Overview

This tutorial builds somewhat on what we have covered in the previous tutorial - Text Mining and Sentiment Analysis. It provides an introduction to bag of words models, term frequency inverse document frequency (TF-IDF) concept as well as topic modelling. A lot of the applied tasks and examples given are a continuation of the methods and learnings from the previous tutorial as well.

The goal is to continue our learning in the natural language processing landscape, essentially be looking at a greater selection of methods for examining text data, manipulating it into different forms, and extracting insight from it. The tutorial is broken in three parts, however they flow neatly from one to another. A summary of each part is given below:

PART 1: introduces bag of words models; representation of text that describes the occurrence of words within a document PART 2: we cover TF-IDF; a method that reflects how important a word is to a document in a collection or corpus *PART 3: introduce Topic Modelling; an unsupervised classification approach for documents

References

The work in this tutorial is largely based off the working of Ian Durbach; his profile is found here. In fact, a lot of the work follows suit from Data Science for Industry course from university of Cape Town. Other great resources that are used for this tutorial are discussed below.

One of the resources is Chapter 3 of Text Mining in R and covers tf-idf and related topics. In fact, Text Mining in R by Julia Silge and David Robinson, is highly recommended, as this book and their approach closely followed.

The code in part 3, topic modelling lesson, borrows heavily from Chapter 6 of Text Mining for R, and also contains some of the material on converting between tidy and non-tidy text formats in Chapter 5.

Prerequisites

The document is for the most part very applied in nature, and doesn’t assume much natural language processing experience. For programming purposes, it would be useful if you are familiar with the tidyverse, or at least dplyr specifically.

It must be stressed that this is only a starting point, a hopefully fun foray into the world of text, not definitive statement of how you should analyze text. In fact, some of the methods demonstrated would likely be too rudimentary for most goals.

Part 1: Bag of Words

This part of our notebook covers bag-of-words models. Used in NLP settings, where a text is represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity.

In the notebook we:

Show how to extract bag-of-words features
Build a classification tree with bag-of-words
Validation, using in and out of sample

Background and Introduction

In a bag-of-words model, a text document is represented by the set of words used in the document. This is a simplified representation that ignores much of the more subtle structure of text, like the order that words appear in and grammar. Frequency counts of the words used in a document can be used to cluster documents or as features in predictive modelling.

Example of Application

For example, say we have a collection of newspaper articles drawn from entertainment and business categories. The set of documents is known as the corpus. The set of all unique words used in any of the articles constitutes our bag-of-words, also called the vocabulary.

We can represent this data as a matrix, in which each article is a row, each word is a column, and the entry in row i and column j contains the number of times word j appears in document i (many of these values will be zero, so we have a sparse matrix). A final column contains our response variable, indicating the type of article (entertainment or business).

We could use the word frequencies as features to build a model that, on the basis of frequencies of different words, predicts whether an article is about business or entertainment. This is a typical kind of use for a bag-of-words model. Some hypothetical data is shown below.

document	sing	money	the	happy	dividend	…	response
1	5	0	15	0	0	…	ent
2	0	5	12	2	0	…	bus
3	0	0	3	0	6	…	bus
…	…	…	…	…	…	…	…
100	10	0	13	10	2	…	ent

Setup

We’ll start by loading the packages we need, loading the data containing the tweets, and doing some wrangling. Most of this is the same as what we have been doing for the NLP tutorials.

Specifically, we:

parse the date the tweet was created,
make a new variable indicating whether the tweet was made while president,
do some cleaning of the tweets to remove links, retweets. Note that this time we do not remove stop words as we handle these via the down weighting performed by tf-idf. The relevant line of code has been commented out below - you can remove the comment and hence remove stop words,
turn the data into tidy text data by tokenization by word (plus some stuff important to tweets, like hashtags).

This is mostly the same as what we did in the previous tutorial, so refer back to it if anything is not clear. In addition, we take a sample of 1000 tweets before and after he became president. That will be enough to build a model later.

library(tidyverse)
library(stringr)
library(lubridate)
library(tidytext)
library(rpart) 

load('trump-tweets.RData')    

# make data a tibble
tweets <- as_tibble(tweets)

# parse the date and add some date related variables
tweets <- tweets %>% 
  mutate(date = parse_datetime(str_sub(tweets$created_at, 5, 30), "%b %d %H:%M:%S %z %Y")) %>% 
  mutate(is_potus = (date > ymd(20161108))) %>%
  mutate(month = make_date(year(date), month(date)))

# take a random sample of 1000 tweets before and he after became president
set.seed(123)
tweets_sample <- tweets %>% group_by(is_potus) %>% slice_sample(n = 1000) %>% ungroup()

# turn into tidy text 
replace_reg <- '(http.*?(\\s|.$))|(www.*?(\\s|.$))|&amp;|&lt;|&gt;'
unnest_reg <- "[^\\w_#@'’]"
tidy_tweets <- tweets_sample %>% 
  filter(is_retweet == FALSE) %>%                                        #remove retweets
  mutate(text = str_replace_all(text, replace_reg, '')) %>%              #remove stuff we don't want like links
  unnest_tokens(word, text, token = 'regex', pattern = unnest_reg) %>%   #tokenize
  # filter(!word %in% stop_words$word, str_detect(word, '[A-Za-z]')) %>% #LEAVE THE STOP WORDS IN THIS TIME
  filter(!str_detect(word, '@real[Dd]onald[Tt]rump')) %>%                #remove @realDonaldTrump, which we know is highly predictive
  select(date, word, is_potus, favorite_count, id_str, month)            #choose the variables we need

Let’s look at the most popular words:

tidy_tweets %>%
  group_by(word) %>%
  count() %>%
  ungroup() %>% 
  arrange(desc(n)) %>%
  filter(rank(desc(n)) <= 10)

## # A tibble: 10 × 2
##    word      n
##    <chr> <int>
##  1 the    1694
##  2 to      927
##  3 and     890
##  4 a       747
##  5 of      688
##  6 in      617
##  7 is      614
##  8 for     454
##  9 i       438
## 10 you     413

Extracting Bag-of-Words

Now we shall seek to extract bag-of-words data from the text. We put our data into “bag-of-words” form, by:

finding all unique words used in 2000 tweets,
counting how many times each of these words was used in each tweet.

Here we simplify things by looking only at the frequency of the 200 most popular words. We first find out what the 200 most commonly used words are. We’re doing this a slightly different way to the previous block of code, where we first sorted (by the count variable n, using arrange), and then filtered by rank. Here we use the top_n() , a convenience function that selects the top or bottom entries in each group, ordered by wt.

word_bag <- tidy_tweets %>%
  group_by(word) %>%
  count() %>%
  ungroup() %>%
  top_n(200, wt = n) %>%
  select(-n)

We actually end up with more than 200 words, because the 200 in top_n() refers to the rank of the last observation that gets included. All words tied for the last rank (200) get included, and we end up with more than 200 words. That’s really not a problem here - because the main goal of the subsetting is just to reduce the full problem to something more manageable in the tutorial, we don’t really need to worry if we have exactly 200 words to use as features.

nrow(word_bag)

## [1] 203

Now we calculate the number of times each of these words was used in each of the 2000 tweets. Note that because we’re using a subset of words, we might find that some tweets don’t contain any of these words in them. These observations are dropped from the analysis.

tweets_tdf <- tidy_tweets %>%
  inner_join(word_bag) %>%
  group_by(id_str,word) %>%
  count() %>%  
  group_by(id_str) %>%
  mutate(total = sum(n)) %>%
  ungroup()

We now reshape the data set to get it into traditional format for predictive modelling, using pivot_wider(). This means that each tweet will be in its own row, and each word in its own column. Note that we’re moving here from a tidy to untidy format, because that is the format required by rpart, the classification tree package we’re going to use to do further modelling.

bag_of_words <- tweets_tdf %>% 
  select(id_str, word, n) %>% 
  pivot_wider(names_from = word, values_from = n, values_fill = 0) %>%
  left_join(tweets %>% select(id_str, is_potus)) %>%
  select(id_str, is_potus, everything())

# number of tweets
nrow(bag_of_words)

## [1] 1834

# number of variables (words, plus id and response)
ncol(bag_of_words)

## [1] 205

bag_of_words %>% head(6)

## # A tibble: 6 × 205
##   id_str      is_po…¹ ameri…² congr…³  love    of    on people   the    we   you
##   <chr>       <lgl>     <int>   <int> <int> <int> <int>  <int> <int> <int> <int>
## 1 1000064605… TRUE          1       1     1     1     1      1     1     1     1
## 2 1000151354… TRUE          0       0     0     1     0      0     0     0     0
## 3 1000174070… TRUE          0       0     0     0     1      0     2     1     0
## 4 1000366478… TRUE          1       0     0     2     0      1     4     0     0
## 5 1000718611… TRUE          0       0     0     0     0      0     2     0     0
## 6 1001099771… TRUE          0       0     0     0     0      0     1     0     0
## # … with 194 more variables: a <int>, and <int>, do <int>, down <int>, i <int>,
## #   it <int>, must <int>, no <int>, nothing <int>, obama <int>, s <int>,
## #   security <int>, u <int>, with <int>, about <int>, are <int>, be <int>,
## #   `if` <int>, `in` <int>, korea <int>, north <int>, that <int>, very <int>,
## #   which <int>, will <int>, at <int>, from <int>, good <int>, great <int>,
## #   his <int>, house <int>, news <int>, should <int>, this <int>, back <int>,
## #   been <int>, by <int>, going <int>, have <int>, help <int>, hunt <int>, …

Each tweet is represented by a row. That is, each word is represented by a column.

Data Imbalance

Losing some tweets is actually a bit of a problem. There are two reasons we want the same number of tweets in both “before” and “after” classes of the response. First, many predictive modelling approaches do better with balanced data. Second, it gives us an easy benchmark (50%) to evaluate our results against. So we should check whether the dropped tweets have come disproportionately from one response class and whether the classes are still balanced.

table(bag_of_words$is_potus)

## 
## FALSE  TRUE 
##   975   859

Here we see that whilst only 25 non-presidential tweets were lost by only taking the 204 most frequent words, 141 presidential tweets were removed. To balance the classes again, we will randomly select tweets from the majority class equal to the number in the minority class.

min_class_size <- min(table(bag_of_words$is_potus))
bag_of_words <- bag_of_words %>% group_by(is_potus) %>% sample_n(min_class_size) %>% ungroup()
table(bag_of_words$is_potus)

## 
## FALSE  TRUE 
##   859   859

Classification Tree - with Term Frequency B

Here, we shall start building our bag-of-words classifier model.

We’d like to know whether we can use these word frequencies to predict whether a Trump tweet was made while he was president or not.

We have a binary response, and continuous (integer) predictors, so we can use a variety of approaches to model this problem. Here we’ll use a binary classification tree constructed using the CART algorithm, implemented in the rpart package.

Before building the tree, we split our data into training and test sets. I’ve included 70% of the data in the training set and left the rest for testing.

set.seed(321)
training_ids <- bag_of_words %>% 
  group_by(is_potus) %>% 
  slice_sample(prop = 0.7) %>% 
  ungroup() %>%
  select(id_str)

training_tweets <- bag_of_words %>% 
  right_join(training_ids, by = 'id_str') %>%
  select(-id_str)

test_tweets <- bag_of_words %>% 
  anti_join(training_ids, by = 'id_str') %>%
  select(-id_str)

We then fit a tree to the training data.

fit <- rpart(is_potus ~ ., training_tweets, method = 'class')

And plot the full tree.

# options(repr.plot.width = 12, repr.plot.height = 10) # set plot size in the notebook
plot(fit, main = 'Full Classification Tree')
text(fit, use.n = TRUE, all = TRUE, cex=.8)

First split was on the word “and”. This is the key idea here. We are interested in seeing the words, i.e., what are the key words which we are we using to see if this was pre or post presidency.

We can assess the accuracy in training data set by extracting a cross-table of predicted against observed classifications, and calculating the percentage of classification the model got correct. Because we have balanced class sizes (the same number of tweets before and after presidency) a completely random classifier returns, on average, 50% accuracy.

fittedtrain <- predict(fit, type = 'class')
predtrain <- table(training_tweets$is_potus, fittedtrain)
predtrain

##        fittedtrain
##         FALSE TRUE
##   FALSE   466  135
##   TRUE    195  406

paste("Training Accuracy is:", round(sum(diag(predtrain))/sum(predtrain), 3)) # training accuracy

## [1] "Training Accuracy is: 0.725"

We can do the same in the test dataset - this is the accuracy we would be most interested in.

fittedtest <- predict(fit, newdata = test_tweets, type = 'class')
predtest <- table(test_tweets$is_potus, fittedtest)
predtest

##        fittedtest
##         FALSE TRUE
##   FALSE   192   66
##   TRUE     95  163

paste("Testing Accuracy is:", round(sum(diag(predtest))/sum(predtest), 3))  # test accuracy

## [1] "Testing Accuracy is: 0.688"

Ultimately this is the procedure for the bag of words model. Essentially it just looks at the term frequency, as an indicator of how important that word is in each document (tweet).

Part 2: TF-IDF

We now introduce TF-IDF concept. Specifically in this section we shall briefly discuss the maths and background to TF-IDF; a way of weighting word frequencies that adjusts for how common a word is across documents (and hence how “special” a word is to a particular document). We then:

Build another classification tree, based on tf-idf weighted bag-of-words features, and compare our results

Background and Introduction

So what we have seen so far, particularly in the previous tutorial on Sentiment Analysis and Text Mining, is that we focused on identifying the frequency of individual terms within a document along with the sentiments that these words provide. It is also important to understand the importance that words provide within and across documents.

So consider our previous example - Donald Trump’s tweets. We have words contained in tweets, and specifically the frequencies of those words, to discriminate between a tweet made by Donald Trump was made while he was president or not. This is standard practice in text mining: we describe a document by the words that are contained in the document. Now, how can we decide what words are “important” to a document - by which we mean distinguish that document from another?

We have two approaches:

use word frequency. If a word like “apple” appears frequently in a document, it is more likely to be about a recipe or health than about politics, for example. However some words that appear frequently are not particularly useful: words like “the”, “a”, and so on. These are words that may appear frequently in a particular document, but are not interesting to us because they also appear frequently in many other documents. We’ve previously encountered these as stop words, and we’ve typically removed them from the analysis to reveal more interesting underlying patterns.So this essentially relied on term frequency (tf); which identifies how frequently a word occurs in a document.
rather than removing the stop words, is to downweight them. This is what tf-idf (term frequency-inverse document frequency) does. Here “term” is just used as a general indicator of the unit of text we are interested in, like “token”. Essentially, TF-IDF decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents.

Tf-idf is composed of two parts:

The tf part: A term’s frequency increases with how many times it is used in that document. For tf-idf we use relative frequencies: the number of times a word appears, divided by the total number of words in that document.
The idf: A term’s inverse document frequency is a measure of how many documents (in the corpus) contain that term. It decreases the weight for commonly used words (which all or most documents will use) relative to words that are not used by many documents in a corpus.’

The Math and Application

Specifically, we are going to replace the term frequencies we were using before with new values called “inverse document frequency weighted term frequencies” or tf-idf for short. So each term in each document will get its own tf-idf “score”, which we denote $tfidf(\text{term t in document i})$.

Then:

\[tfidf(\text{term t in document i}) = tf(\text{term t in document i}) \times idf(\text{term t})\]

where

\[tf(\text{term t in document i}) = \displaystyle\frac{\text{Number of times term t appears in document i}}{\text{Number of terms in document i}}\]

and

\[idf(\text{term t}) = \ln\biggl(\displaystyle\frac{\text{Number of documents in corpus}}{\text{Number of documents containing term t}}\biggr)\]

The log function increases in its argument, and the number of documents in the corpus is fixed, so as the number of documents containing a term increases (the denominator above), the idf part decreases. The log transform ensures that this decrease is steep initally, and then levels off. Below I show how the idf weight changes with the number of documents (out of a total of 100 documents) that have a term.

# options(repr.plot.width = 8, repr.plot.height = 8) # set plot size in the notebook
plot(1:100, log(100/1:100), type = 'l', 
     xlab = 'Number of documents (out of 100) containing term t', ylab = 'idf(t)')

So suppose we have 100 documents. And we have a word that appears only once. That score IDF is gonna be very high. If we had 10 documents, that score will half.

As a bonus, we no longer have to carry around stop word dictionaries around with us - using idf allows us to include all words and let idf down weight the ones that are most common.

Application

Let’s do this now in R. Below we carry out the calculations above to work out tf-idf values for the words (terms) in our tweets (documents).

ndocs <- length(unique(tweets_tdf$id_str))

idf <- tweets_tdf %>% 
  group_by(word) %>% 
  summarize(docs_with_word = n()) %>% 
  ungroup() %>%
  mutate(idf = log(ndocs / docs_with_word)) %>% arrange(desc(idf))

tweets_tdf <- tweets_tdf %>% 
  left_join(idf, by = 'word') %>% 
  mutate(tf = n/total, tf_idf = tf * idf)

Let’s explore how the weighting affects the terms that come up as “most important”, by looking at one particular tweet.

set.seed(321)
random_tweet <- sample(tweets_tdf$id_str, 1)
tweets %>% filter(id_str == random_tweet) %>% select(text)

## # A tibble: 1 × 1
##   text                              
##   <chr>                             
## 1 Heading to Alabama now, big crowd!

Below we rank words in descending order of importance by the criterion of word frequency (n). By changing this to inverse document frequency (idf) and tf-idf (tf_idf) you can see which words become more or less important, and get a sense for why.

tweets_tdf %>% filter(id_str == random_tweet) %>% arrange(desc(n))

## # A tibble: 3 × 8
##   id_str             word      n total docs_with_word   idf    tf tf_idf
##   <chr>              <chr> <int> <int>          <int> <dbl> <dbl>  <dbl>
## 1 911338971250216960 big       1     3             85 3.07  0.333  1.02 
## 2 911338971250216960 now       1     3            111 2.80  0.333  0.935
## 3 911338971250216960 to        1     3            681 0.991 0.333  0.330

We’ve done the tf-idf calculation “from scratch” to get a better understanding of what is happening. The tidytext package has a function bind_tf_idf() that does the same thing.

tweets_tdf <- tweets_tdf %>% 
  select(-idf, -tf, -tf_idf) %>% #remove the old ones we worked out
  bind_tf_idf(word, id_str, n)   #replace with values from tidytext

tweets_tdf %>% filter(id_str == random_tweet) %>% arrange(desc(n)) # check same as above

## # A tibble: 3 × 8
##   id_str             word      n total docs_with_word    tf   idf tf_idf
##   <chr>              <chr> <int> <int>          <int> <dbl> <dbl>  <dbl>
## 1 911338971250216960 big       1     3             85 0.333 3.07   1.02 
## 2 911338971250216960 now       1     3            111 0.333 2.80   0.935
## 3 911338971250216960 to        1     3            681 0.333 0.991  0.330

Classification Tree - with TF-IDF

Tf-idf features often give better accuracy in predictive modelling than using word frequencies. In this section we repeat the earlier analysis, building a classification tree to predict whether a tweet was made before or after becoming president, but replacing the word frequency features with the tf-idf values calculated above.

We start by reshaping the data set:

tfidf <- tweets_tdf %>% 
  select(id_str, word, tf_idf) %>%  # note the change, using tf-idf
  pivot_wider(names_from = word, values_from = tf_idf, values_fill = 0) %>%  
  left_join(tweets %>% select(id_str,is_potus))

We use same training and test sets as before.

training_tweets <- tfidf %>% 
  right_join(training_ids, by = 'id_str') %>%
  select(-id_str)

test_tweets <- tfidf %>% 
  anti_join(training_ids, by = 'id_str') %>%
  select(-id_str)

We fit a tree to training data:

fit <- rpart(factor(is_potus) ~ ., training_tweets)

Plot the tree we just created:

# options(repr.plot.width = 12, repr.plot.height = 10) # set plot size in the notebook
plot(fit, main='Full Classification Tree')
text(fit, use.n=TRUE, all=TRUE, cex=.8)

And check the accuracy in training and test data sets:

fittedtrain <- predict(fit, type = 'class')
predtrain <- table(training_tweets$is_potus, fittedtrain)
predtrain

##        fittedtrain
##         FALSE TRUE
##   FALSE   479  122
##   TRUE    223  378

paste("Training Accuracy is:", round(sum(diag(predtrain))/sum(predtrain), 3)) # training accuracy

## [1] "Training Accuracy is: 0.713"

fittedtest <- predict(fit,newdata=test_tweets,type='class')
predtest <- table(test_tweets$is_potus,fittedtest)
predtest

##        fittedtest
##         FALSE TRUE
##   FALSE   298   76
##   TRUE    112  146

paste("Testing Accuracy is:", round(sum(diag(predtest))/sum(predtest), 3)) # test accuracy

## [1] "Testing Accuracy is: 0.703"

We get a very slight improvement in accuracy from replacing word frequency features with ones based on tf-idf.

Example: Harry Potter

This example leverages the data provided in the harrypotter package. This package provides access to the full texts of the first seven Harry Potter books.

We load the harrypotter package.

# load package
if (packageVersion("devtools") < 1.6) {
  install.packages("devtools")
}

devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter)

The seven novels we are working with, and are provided by the harrypotter package, include: philosophers_stone: Harry Potter and the Philosophers Stone (1997) chamber_of_secrets: Harry Potter and the Chamber of Secrets (1998) prisoner_of_azkaban: Harry Potter and the Prisoner of Azkaban (1999) goblet_of_fire: Harry Potter and the Goblet of Fire (2000) order_of_the_phoenix: Harry Potter and the Order of the Phoenix (2003) half_blood_prince: Harry Potter and the Half-Blood Prince (2005) *deathly_hallows: Harry Potter and the Deathly Hallows (2007)

We could have a look at the the raw text of the first chapter of the philosophers_stone, however the output (as expected) is quite lengthy. If you want you could execute the code yourdelf. We look at the data structruer below. We see that in Harry Potter and the Philosophers Stone, we have 17 chapters.

str(philosophers_stone)

##  chr [1:17] "THE BOY WHO LIVED　　Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfe"| __truncated__ ...

Now, with our data we need to do what we have always been doing - turning it into tiday format. We need to do this in order to compute term frequencies. The following converts all seven Harry Potter novels into a tibble that has each word by chapter by book.

# Convert to Tidy
titles <- c("Philosopher's Stone", "Chamber of Secrets", "Prisoner of Azkaban",
            "Goblet of Fire", "Order of the Phoenix", "Half-Blood Prince",
            "Deathly Hallows")

books <- list(philosophers_stone, chamber_of_secrets, prisoner_of_azkaban,
           goblet_of_fire, order_of_the_phoenix, half_blood_prince,
           deathly_hallows)
  
series <- tibble()

for(i in seq_along(titles)) {
        
        clean <- tibble(chapter = seq_along(books[[i]]),
                        text = books[[i]]) %>%
             unnest_tokens(word, text) %>%
             mutate(book = titles[i]) %>%
             select(book, everything())

        series <- rbind(series, clean)
}

# set factor to keep books in order of publication
series$book <- factor(series$book, levels = rev(titles))

series

## # A tibble: 1,089,386 × 3
##    book                chapter word   
##    <fct>                 <int> <chr>  
##  1 Philosopher's Stone       1 the    
##  2 Philosopher's Stone       1 boy    
##  3 Philosopher's Stone       1 who    
##  4 Philosopher's Stone       1 lived  
##  5 Philosopher's Stone       1 mr     
##  6 Philosopher's Stone       1 and    
##  7 Philosopher's Stone       1 mrs    
##  8 Philosopher's Stone       1 dursley
##  9 Philosopher's Stone       1 of     
## 10 Philosopher's Stone       1 number 
## # … with 1,089,376 more rows

From this cleaned up text we can compute the term frequency for each word. Lets do this for computing term frequencies by book and across the entire Harry Potter series.

book_words <- series %>%
        count(book, word, sort = TRUE) %>%
        ungroup()

series_words <- book_words %>%
        group_by(book) %>%
        summarise(total = sum(n))

book_words <- left_join(book_words, series_words)

book_words

## # A tibble: 67,881 × 4
##    book                 word      n  total
##    <fct>                <chr> <int>  <int>
##  1 Order of the Phoenix the   11740 258763
##  2 Deathly Hallows      the   10335 198906
##  3 Goblet of Fire       the    9305 191882
##  4 Half-Blood Prince    the    7508 171284
##  5 Order of the Phoenix to     6518 258763
##  6 Order of the Phoenix and    6189 258763
##  7 Deathly Hallows      and    5510 198906
##  8 Order of the Phoenix of     5332 258763
##  9 Prisoner of Azkaban  the    4990 105275
## 10 Goblet of Fire       and    4959 191882
## # … with 67,871 more rows

Here, we can see that common and noncontextual words (“the”, “to”, “and”, “of”, etc.) rule the term frequency list. We can visualize the distribution of term frequency for each novel. Here we’ll look at the distribution of n/total.

book_words %>%
        mutate(ratio = n / total) %>%
        ggplot(aes(ratio, fill = book)) +
        geom_histogram(show.legend = FALSE) +
        scale_x_log10() +
        facet_wrap(~ book, ncol = 2)

Since the distribution is so clustered around 0 I add scale_x_log10() to spread it out. Even so we see the long right tails for thos extremely common words

Zipf’s law states that within a group or corpus of documents, the frequency of any word is inversely proportional to its rank in a frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. Zipf’s law is most easily observed by plotting the data on a log-log graph, with the axes being log(rank order) and log(term frequency).

freq_by_rank <- book_words %>%
        group_by(book) %>%
        mutate(rank = row_number(),
               `term freq` = n / total)
        

ggplot(freq_by_rank, aes(rank, `term freq`, color = book)) +
        geom_line() +
        scale_x_log10() +
        scale_y_log10()

Our plot illustrates that the distribution is similar across the seven books. Lert us now look at applying TF-IDF.

The idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents, in this case, the Harry Potter series.

Calculating tf-idf attempts to find the words that are important (i.e., common) in a text, but not too common. Or put another way, tf-idf helps to find the important words that can provide specific document context.

We can easily compute the idf and tf-idf using the bind_tf_idf function provided by the tidytext package.

book_words <- book_words %>%
        bind_tf_idf(word, book, n)

book_words

## # A tibble: 67,881 × 7
##    book                 word      n  total     tf   idf tf_idf
##    <fct>                <chr> <int>  <int>  <dbl> <dbl>  <dbl>
##  1 Order of the Phoenix the   11740 258763 0.0454     0      0
##  2 Deathly Hallows      the   10335 198906 0.0520     0      0
##  3 Goblet of Fire       the    9305 191882 0.0485     0      0
##  4 Half-Blood Prince    the    7508 171284 0.0438     0      0
##  5 Order of the Phoenix to     6518 258763 0.0252     0      0
##  6 Order of the Phoenix and    6189 258763 0.0239     0      0
##  7 Deathly Hallows      and    5510 198906 0.0277     0      0
##  8 Order of the Phoenix of     5332 258763 0.0206     0      0
##  9 Prisoner of Azkaban  the    4990 105275 0.0474     0      0
## 10 Goblet of Fire       and    4959 191882 0.0258     0      0
## # … with 67,871 more rows

Notice how the common non-contextual words (“the”, “to”, “and”, etc.) have high tf values but their idf and tf-idf values are 0. Note, that common terms that occur in every document will have an IDF of 0. We can look at the words that have the highest tf-idf values.

Here we see mainly names for characters in each book that are unique to that book, and therefore used often, but are absent or nearly absent in the other books.

book_words %>%
        arrange(desc(tf_idf))

## # A tibble: 67,881 × 7
##    book                 word            n  total       tf   idf   tf_idf
##    <fct>                <chr>       <int>  <int>    <dbl> <dbl>    <dbl>
##  1 Half-Blood Prince    slughorn      335 171284 0.00196  1.25  0.00245 
##  2 Deathly Hallows      c            1300 198906 0.00654  0.336 0.00220 
##  3 Order of the Phoenix umbridge      496 258763 0.00192  0.847 0.00162 
##  4 Goblet of Fire       bagman        208 191882 0.00108  1.25  0.00136 
##  5 Chamber of Secrets   lockhart      197  85401 0.00231  0.560 0.00129 
##  6 Prisoner of Azkaban  lupin         369 105275 0.00351  0.336 0.00118 
##  7 Goblet of Fire       winky         145 191882 0.000756 1.25  0.000947
##  8 Goblet of Fire       champions      84 191882 0.000438 1.95  0.000852
##  9 Deathly Hallows      xenophilius    79 198906 0.000397 1.95  0.000773
## 10 Half-Blood Prince    mclaggen       65 171284 0.000379 1.95  0.000738
## # … with 67,871 more rows

To understand the most common contextual words in each book we can take a look at the top 15 terms with the highest tf-idf.

book_words %>%
        arrange(desc(tf_idf)) %>%
        mutate(word = factor(word, levels = rev(unique(word))),
               book = factor(book, levels = titles)) %>% 
        group_by(book) %>%
        top_n(15, wt = tf_idf) %>%
        ungroup() %>%
        ggplot(aes(word, tf_idf, fill = book)) +
        geom_bar(stat = "identity", alpha = .8, show.legend = FALSE) +
        labs(title = "Highest tf-idf words in the Harry Potter series",
             x = NULL, y = "tf-idf") +
        facet_wrap(~book, ncol = 2, scales = "free") +
        coord_flip()

As you can see most of these high ranking tf-idf words are nouns that provide specific context around the the most common characters in each individual book.

Summary and Conclusion from Part 1 & 2

Examples in Parts 1 and 2 illustrated how we can use bag of word models down weighted using TF IDF allows us to vectorize documents containing tokens in such a way we can use predictive modelling in a supervised learning sense to classify documents to one of any number of classes, based on the contents of that document.

The TF-IDF statistic computes the frequency of a term adjusted for how rarely it is used. Since the ratio inside the idf’s log function is always greater than or equal to 1, the value of idf (and tf-idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0. So, it is essentially a measure of the ‘importance’ of a particular word for a given document, and it’s often used in combination with other analyses, for example as an input to a text classifier, or to help rank search results.

Part 3: Topic Modelling

asking what a document is about. Idea is to see what a document is all about and group similar documents which have similar themes, based on the contents.

This notebook covers topic modelling and in particular one method used to do topic modelling, latent dirichlet allocation.

Specifically, we shall be looking at:

Look at how a topic model attempts to “summarize” the document-term matrix
Describe Latent Dirichlet allocation (LDA), a popular method for doing topic modeling that avoids some of the pitfalls of other methods.
Show how you can do topic modelling with LDA using the topicmodels package

Note that during this tutorial we shall pave the way by introducing a number of methods for doing topic modelling, and also some fundamental modelling “building blocks” like maximum likelihood. We finish off by creating a topic model for a collection of movie reviews.

Background and Introduction

Topic modelling is a type of bag-of-words model that aims to summarize a document by describing it in terms of a small number of “topics”. The resulting descriptions can then be used for several purposes, for example, to judge similarity between documents and to cluster together similar documents, or for document classification.

As we’ve seen, in the bag-of-words model a document is represented by:

the words that appear in it, and
the frequency with which each word appears

When a number of documents are collected into a corpus, we start by identifying all the words that are used by the documents in the corpus – the “vocabulary” of the corpus. Each document can then be thought of as a (often very) long vector of frequencies (as long as the vocabulary), many of which will be zero. These vectors, stacked upon each other, one per document, give us the document-term matrix.

The “wideness” of the document-term matrix poses a challenge to efficient further analysis, such as clustering and classification. So, we might ask whether we can reduce the dimensionality of the problem by summarizing each document in some way that preserves most of the information contained in the vector of word frequencies, but that makes subsequent analysis easier and more efficient.

This is what topic modelling does. TLDR; In a nutshell, it summarizes documents by describing them in terms of a smaller number of topics. Over and above improving the efficiency with which further analyses can be done, the resulting summaries contain useful information about what a document is “about”.

The idea of MLE is highly prevalent with topic modelling. So we shall be observing frequency distributions of the words, and we trying to maximize likelihood of seeing that particular set of words, by changing the topic probabilities - so the topics are what we interested but we don’t know what the probabilities are, so we are trying to get them by tweaking them to see such that the observed set of words are as likely as possible.

Data Used

We shall be moving away from using the trump data set, containing all the tweets made by him (Donald Trump). We move onto using a reviews data set. Essentially what we have is 15 documents, and 35 words taken from each document.

Applying a Topic Model

In this example we construct a topic model on reviews for two different movies, Taxi Driver and Waterworld. Since we know that the main “topics” should be to do with the two movies, we can easily assess whether LDA is doing a good job at separating out topics, and when it fails.

Using web scraping, we have collected 160 reviews of Taxi Driver, and 160 reviews of Waterworld from the IMDb site. Here you can read the synopses for Taxi Driver and Waterworld. Clearly they are quite different types of movies, although they have some things in common, like being about a loner who protects a young girl from harm.

We start by loading required packages as well as the data.

library(tidyverse)
library(tidytext)
library(topicmodels)

load('my_imdb_reviews.RData')

We then do a bit of data cleaning to make subsequent analyses easier. We change the class of the review variable from factor to character and add a document ID variable.

reviews <- as_tibble(reviews)
reviews$review <- as.character(reviews$review) 
reviews$reviewId <- 1:nrow(reviews)

We then tokenize the reviews into individual words and remove stop words from the data frame.

tidy_reviews <- reviews %>% 
  unnest_tokens(word, review, token = 'words', to_lower = T) %>%
  filter(!word %in% stop_words$word)

We can now count the number of times each word in our vocabulary was used in each review, creating a “long” (and tidy) format of the document-term matrix.

reviews_tdf <- tidy_reviews %>%
  group_by(reviewId,word) %>%
  count() %>%  
  ungroup()

Previously we have created our own “wide” versions of the document-term matrix using the pivot_wider() function. We did this, for example, when building a bag-of-words model in the Lecture 6.1 notebook. This was fine when building features for a classification tree (using rpart), but some R packages require a particular implementation of document-term matrix. One common implementation is the DocumentTermMatrix class provided by the tm package.

The topicmodels package, which we use to implement the LDA topic model, requires the document-term matrix to be provided as a DocumentTermMatrix object. Fortunately, tidytext provides a number of functions for turning “ordinary” document-term matrices into DocumentTermMatrix (or other kinds of) objects, and also for going the other way (tidying document-term matrices). This is described in more detail here. Essentially, we create the DocumentTermMatrix object using cast_dtm() as shown below.

dtm_reviews <- reviews_tdf %>% 
  cast_dtm(reviewId, word, n)

We can now estimate the parameters of the topic model using LDA. We use the LDA() function provided by the package topicmodels. We need to specify the number of latent variables (topics) we wish to use.

Notice that the object that is returned by LDA() below has a specific class, the “LDA_VEM” class. This is a so-called “S4” object. R provides a number of different systems for object oriented programming: the main ones are an older, more flexible “S3” object class, and a newer, stricter “S4” class. We don’t need to worry about these intricacies, but one important practical difference is that the $ operator (the way you reference variables within an object e.g. tweets$text) is not defined for the S4 class. Instead, you use the @ operator.

reviews_lda <- LDA(dtm_reviews, k = 2, control = list(seed = 1234))
reviews_lda

## A LDA_VEM topic model with 2 topics.

str(reviews_lda)

## Formal class 'LDA_VEM' [package "topicmodels"] with 14 slots
##   ..@ alpha          : num 0.0317
##   ..@ call           : language LDA(x = dtm_reviews, k = 2, control = list(seed = 1234))
##   ..@ Dim            : int [1:2] 320 7992
##   ..@ control        :Formal class 'LDA_VEMcontrol' [package "topicmodels"] with 13 slots
##   .. .. ..@ estimate.alpha: logi TRUE
##   .. .. ..@ alpha         : num 25
##   .. .. ..@ seed          : int 1234
##   .. .. ..@ verbose       : int 0
##   .. .. ..@ prefix        : chr "/var/folders/y5/7wlh6jzj03185p7x7ggqb_280000gn/T//RtmpY2SYgl/filefdac7a12e0cf"
##   .. .. ..@ save          : int 0
##   .. .. ..@ nstart        : int 1
##   .. .. ..@ best          : logi TRUE
##   .. .. ..@ keep          : int 0
##   .. .. ..@ estimate.beta : logi TRUE
##   .. .. ..@ var           :Formal class 'OPTcontrol' [package "topicmodels"] with 2 slots
##   .. .. .. .. ..@ iter.max: int 500
##   .. .. .. .. ..@ tol     : num 1e-06
##   .. .. ..@ em            :Formal class 'OPTcontrol' [package "topicmodels"] with 2 slots
##   .. .. .. .. ..@ iter.max: int 1000
##   .. .. .. .. ..@ tol     : num 1e-04
##   .. .. ..@ initialize    : chr "random"
##   ..@ k              : int 2
##   ..@ terms          : chr [1:7992] "actors" "affection" "albert" "alienated" ...
##   ..@ documents      : chr [1:320] "1" "2" "3" "4" ...
##   ..@ beta           : num [1:2, 1:7992] -7.07 -7.16 -9.97 -9.72 -7.03 ...
##   ..@ gamma          : num [1:320, 1:2] 1 1 1 1 1 ...
##   ..@ wordassignments:List of 5
##   .. ..$ i   : int [1:30858] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ j   : int [1:30858] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ v   : num [1:30858] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ nrow: int 320
##   .. ..$ ncol: int 7992
##   .. ..- attr(*, "class")= chr "simple_triplet_matrix"
##   ..@ loglikelihood  : num [1:320] -1328 -959 -2045 -571 -1742 ...
##   ..@ iter           : int 27
##   ..@ logLiks        : num(0) 
##   ..@ n              : int 37987

The main things we are interested in are:

beta: these parameters control the probability of a given topic k generating a particular word i.
gamma: this gives the topic “mixtures” for each document.

We’ll first look at the beta parameters and examine which words are most associated with each topic. This will be a good way of checking, in our case, whether the words align themselves into two topics we are expecting (i.e. one to do with Taxi Driver, the other Waterworld). More generally, you would do this to assess what the topics are about.

Word-topic probabilities

We start by creating a data frame containing the “terms” (the words we tokenized the reviews into), and the beta parameters (one beta parameter per-word-per-topic).

term <- as.character(reviews_lda@terms)
topic1 <- reviews_lda@beta[1,]
topic2 <- reviews_lda@beta[2,]
reviews_topics <- tibble(term = term, topic1 = topic1, topic2 = topic2)

The data frame above is not yet in “tidy” format, because the beta parameters are spread across two columns. We use the pivot_longer() function to get the data into a tidy format. At the same time, we transform the beta parameters into probabilities by exponentiating each beta.

reviews_topics <- reviews_topics %>% 
  pivot_longer(c(topic1, topic2), names_to = 'topic', values_to = 'beta') %>%
  mutate(beta = exp(beta)) # pr(topic k generates word i) = exp(beta_ik)
head(reviews_topics)

## # A tibble: 6 × 3
##   term      topic       beta
##   <chr>     <chr>      <dbl>
## 1 actors    topic1 0.000848 
## 2 actors    topic2 0.000775 
## 3 affection topic1 0.0000468
## 4 affection topic2 0.0000602
## 5 albert    topic1 0.000889 
## 6 albert    topic2 0.0000607

So if your in topic1, the probability of seeing the word “actors” is the value indicated. Likewise, if you are in topic2, the probability of seeing the word “actors” is the value seen above in the tibble.

The tidytext package provides a function tidy() that tidies a number of different object types, including “LDA_VEM” objects. So you can skip the previous step and use a direct call to tidy() the reviews_lda object we created earlier. You can use the same approach with matrix = "gamma" to extract those parameters.

reviews_topics <- tidy(reviews_lda, matrix = 'beta')
head(reviews_topics)

## # A tibble: 6 × 3
##   topic term           beta
##   <int> <chr>         <dbl>
## 1     1 actors    0.000848 
## 2     2 actors    0.000775 
## 3     1 affection 0.0000468
## 4     2 affection 0.0000602
## 5     1 albert    0.000889 
## 6     2 albert    0.0000607

Finally, we can extract the top 20 terms used in each topic and arrange these in a tidy format, and plot with a call to ggplot().

top_terms <- reviews_topics %>%
  group_by(topic) %>%
  top_n(20, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = 'free') +
  coord_flip()

This visualization shows the most frequent words used by each topic, and this gives us a way of understanding what the topics are about. The words used by Topic 2 are mostly to do with Waterworld (the movie title, actors’ names, some plot elements like “mariner”, “land”, etc). The words used by Topic 1 are mostly about Taxi Driver. So, the topic model is clearly picking up that the reviews are largely about two “topics”, and that these topics relate to the movies being reviewed. Also note that some words (like “movie” and “film”) are common to both topics. This is a benefit of LDA: words can be associated with more than one topic.

A different way of drawing out the differences between topics is to look at which words have the greatest difference in beta values between Topic 1 and Topic 2. This could be done in several ways but a good default is to take the advice in TMR and use the log (base 2) ratio of the two betas: $log_2(\beta_2/\beta_1)$. Any log ratio is symmetrical: the log ratio for $\beta_2/\beta_1$ is -1 times the log ratio for $\beta_1/\beta_2$. A base of 2 in the log ratio makes things a bit easier to interpret. The log (base 2) ratio increases by 1 for every doubling of $\beta_2$ (relative to $\beta_1$).

We calculate the log ratios below and plot the words with the biggest positive and negative log ratios. We first filter out common words, such as those that have a beta greater than 1/1000 in at least one topic.

beta_spread <- reviews_topics %>%
  mutate(topic = paste0('topic', topic)) %>%
  pivot_wider(names_from = topic, values_from = beta) %>%
  filter(topic1 > .001 | topic2 > .001) %>%
  mutate(log_ratio = log2(topic2 / topic1))

beta_spread %>%
  group_by(direction = log_ratio > 0) %>%
  top_n(10, abs(log_ratio)) %>%
  ungroup() %>%
  mutate(term = reorder(term, log_ratio)) %>%
  ggplot(aes(term, log_ratio)) +
  geom_col() +
  labs(y = 'Log2 ratio of beta in topic 2 / topic 1') +
  coord_flip()

The word “dryland” is a very strong indicator that it has something to do with topic 2. The word “atoll” is also something to do with water so it makes sense to be a strong indicator. Note that, an atoll is a ring-shaped coral reef, island, or series of islets. The bottom words show us strong indicators that its topic 1 as opposed to topic 2.

Document-Topic Probabilities

LDA models each document as a mixture of topics. These mixtures can be used in a straightforward way to assess whether a particular document is “mostly about Topic 1” or “mostly about Topic 2”. In our case, since we have a very clear idea both of what the topics are about (i.e. how topics map onto movies) and what the reviews were supposed to be about we can examine these mixtures and see whether they make sense.

The “mixture” of topics in each document is given by document-topic probabilities referred to collectively as “gamma”. Below we extract these from the object containing the LDA results, and merge the gamma values back into the original data frame containing the reviews.

reviews_gamma <- reviews %>% 
    left_join(tidy(reviews_lda, matrix = 'gamma') %>% 
    mutate(reviewId = as.numeric(document)) %>%       #some cleaning to make key variable (reviewId) usable
    select(-document) %>%
    spread(key = topic, value = gamma, sep = '_'))

We know that Topic 1 is clearly about Taxi Driver and Topic 2 is about Waterworld. We can therefore ask:

How many of the reviews for Taxi Driver did the LDA estimate to be mostly about Topic 2 (Waterworld)?
How many of the reviews for Waterwold did the LDA estimate to be mostly about Topic 1 (Taxi Driver)?

We answer this question below (Taxi Driver has imdbID = 0075314). Six of the 160 Taxi Driver reviews were estimated to mostly be about Topic 2, while three of the 160 Waterworld reviews were mostly about Topic 1, according to LDA.

reviews_gamma %>% group_by(imdbId) %>% summarize(ntopic1 = sum(topic_1 > 0.5))

## # A tibble: 2 × 2
##   imdbId  ntopic1
##   <fct>     <int>
## 1 0075314     154
## 2 0114898       3

154 out of 160 is still pretty good. Let’s have a look at a few cases where the topic model got it wrong. First, we can look at a few reviews for Taxi Driver that the model said were mostly about Topic 2 (the other topic - waterwrodl).

reviews_gamma %>% filter(imdbId == '0075314') %>% arrange(topic_1) %>% head(7) %>% select(review, topic_1, topic_2)

## # A tibble: 7 × 3
##   review                                                         topic_1 topic_2
##   <chr>                                                            <dbl>   <dbl>
## 1 " a nicely made dark and majestic experience  see the world t…   0.376   0.624
## 2 " to those who can t figure out the base of this movie  you n…   0.388   0.612
## 3 " i had some hopes for this movie since i found it among the …   0.419   0.581
## 4 " when diving into this movie you may think this is about liv…   0.444   0.556
## 5 " a truly disturbing movie  travis bickle  robert deniro   gr…   0.444   0.556
## 6 " my favourite movie  number one  such a film is once seen in…   0.445   0.555
## 7 " the sicilian connection taxi driver and the winter s tale w…   0.519   0.481

Here we see the Taxi Driver review, and even though it contains words like Taxi Driver quite often, our model gave it a 62% probability of being from topic 2

Compare these to the Taxi Driver reviews that were “most” about Topic 1 (which was, remember, to do with Taxi Driver).

reviews_gamma %>% filter(imdbId == '0075314') %>% arrange(desc(topic_1)) %>% head(3) %>% select(review, topic_1, topic_2)

## # A tibble: 3 × 3
##   review                                                         topic_1 topic_2
##   <chr>                                                            <dbl>   <dbl>
## 1 "  travis bickle is the sort of person you wouldn t even see …    1.00 6.56e-5
## 2 " watching taxi driver made me realize everything i thought i…    1.00 8.42e-5
## 3 " i watched taxi driver  1976  a couple of years ago   though…    1.00 8.72e-5

We notice when people write very long drwn out reviews.

Let’s do the same thing for Waterworld: look at a few reviews that the model said were mostly about Topic 1 (Taxi Driver).

reviews_gamma %>% filter(imdbId != '0075314') %>% arrange(topic_2) %>% head(4) %>% select(review, topic_1, topic_2)

## # A tibble: 4 × 3
##   review                                                         topic_1 topic_2
##   <chr>                                                            <dbl>   <dbl>
## 1 "  best and most efficient opening sequence in film  ever  pe…   0.584   0.416
## 2 "  i can find something to like in almost any film  what s gr…   0.545   0.455
## 3 " this film is mad max combined with planet of the apes   and…   0.543   0.457
## 4 " this can go with the chronicles  of riddick  mad max  termi…   0.485   0.515

And finally, compare the reviews above to reviews that were “most” about Topic 2.

reviews_gamma %>% filter(imdbId != '0075314') %>% arrange(desc(topic_2)) %>% head(3) %>% select(review, topic_1, topic_2)

## # A tibble: 3 × 3
##   review                                                         topic_1 topic_2
##   <chr>                                                            <dbl>   <dbl>
## 1 "      out of     sort of like a mad max meets indiana jones … 7.04e-5    1.00
## 2 " it takes gumption to open a movie with the sight of kevin c… 7.99e-5    1.00
## 3 "  i ll admit it  i liked  waterworld        or parts of it  … 9.12e-5    1.00

We see similar results, when the reviewer writes a lot.

Concluding Thoughts on Topic Modelling

This was just an illustration of how latent dirclct allocation can be used to divide a corpus of documents into different topics.

For more examples and teachings, you can work through this example in Chapter 6 ofon Topic Modelling in Text Mining in R. There is also a topic modelling part to the case study in Chapter 9.

Bag of Words and Topic Modelling

An introduction to bag of words in R

Pavan Singh

2022-09-21

Overview

References

Prerequisites

Part 1: Bag of Words

Background and Introduction

Example of Application

Setup

Extracting Bag-of-Words

Data Imbalance

Classification Tree - with Term Frequency B

Part 2: TF-IDF

Background and Introduction

The Math and Application

Application

Classification Tree - with TF-IDF

Example: Harry Potter

Summary and Conclusion from Part 1 & 2

Part 3: Topic Modelling

Background and Introduction

Data Used

Applying a Topic Model

Word-topic probabilities

Document-Topic Probabilities

Concluding Thoughts on Topic Modelling

document	sing	money	the	happy	dividend	…	response
1	5	0	15	0	0	…	ent
2	0	5	12	2	0	…	bus
3	0	0	3	0	6	…	bus
…	…	…	…	…	…	…	…
100	10	0	13	10	2	…	ent

document	sing	money	the	happy	dividend	…	response
1	5	0	15	0	0	…	ent
2	0	5	12	2	0	…	bus
3	0	0	3	0	6	…	bus
…	…	…	…	…	…	…	…
100	10	0	13	10	2	…	ent

document	sing	money	the	happy	dividend	…	response
1	5	0	15	0	0	…	ent
2	0	5	12	2	0	…	bus
3	0	0	3	0	6	…	bus
…	…	…	…	…	…	…	…
100	10	0	13	10	2	…	ent