Overview

This tutorial provides a foundational introduction to text mining, and in general natural language processing. We shall be covering four key topics - text mining basics, sentiment analysis, bag of words and TF-IDF. The series shall essentially be looking at a selection of methods for examining text data, manipulating it into different forms, and extracting insight from it. In this tutorial we shall mainly be using the tidytext package in R. A summary of each part is given below:

PART 1: outlines the tidy text format and the unnest_tokens() function. We also look at and introduce $n$-grams. PART 2: shows how to perform sentiment analysis on a tidy text data set

References

The work in this tutorial is largely based off the working of Ian Durbach; his profile is found here. In fact, a lot of the work follows suit from Data Science for Industry course from university of Cape Town. Other great resources that are used for this tutorial are discussed below.

One of the resources is Chapter 12 of R4DS, which covers tidy data and the tidyr package.

We also shall be using Text Mining in R by Julia Silge and David Robinson. I highly recommend this book as their approach is to transform the text into a tidy format that allows you to easily analyse and visualize the results using graphs. This particular tutorial follows from chapters 1 to 4 from the book.

Finally, we shall also be making use of Chapter 20 of Data Science from Scratch covers natural language processing, and the text generation section borrows from that chapter. However the use of this resource is limited in our work now.

a gentle introduction to text analysis is provided here. The tutorial also offers a great mini project on looking at Shakespeare Text Analysis.

Prerequisites

The document is for the most part very applied in nature, and doesn’t assume much beyond familiarity with the R statistical computing environment. For programming purposes, it would be useful if you are familiar with the tidyverse, or at least dplyr specifically.

It must be stressed that this is only a starting point, a hopefully fun foray into the world of text, not definitive statement of how you should analyze text. In fact, some of the methods demonstrated would likely be too rudimentary for most goals.

Part 1: Text Mining

In the first part of this notebook, we shall cover

tidy data principles (using the pivot_longer() and pivot_wider() functions from the tidyr package)
tidytext applications to text using functions (using unnest_tokens())
extract summaries from text data
$n$-grams

We shall start this tutorial by looking at “tidy” data format - a powerful way to make handling data easier and more effective. We then seek We shall also build a simple text generator. We then move on to looking at tokens and the application of the unnest function. We then move onto looking at uni-grams, bi-grams and $n$-grams in general. This essentially defines a sequence of n words in a text, and can be used in tokenizing.

Background and Introduction

Dealing with text has been of interest for some time, however it is (has) risen to be one of the hottest topics in Data Science. a lot of individuals within the community who have had rigorous exposure to applied (or theoretical) statistics are often trained to handle tabular or rectangular data that is mostly numeric, but much of the data proliferating today is unstructured and text-heavy. Many of us who work in analytically fields are not trained in even simple interpretation of natural language. Ultimately, working with text offers a plethora of new unexplored insights and opportunities to derive value.

Setup

First load the required packages for this notebook.

library(tidyverse)
library(tidytext)
library(stringr)
library(lubridate)
library(ggpubr)
library(wordcloud)

options(repr.plot.width=4, repr.plot.height=3) # set plot size in the notebook

In this and the following parts of this tutorial, we’ll be using a data set containing all of Donald Trump’s tweets. An archive of his tweets was maintained here. These can be downloaded as zipped JSON files. This repository only contains tweets up to the end of 2018 and has at the time of writing not been updated in several months. The available data have already been put into an .RData file, but should the repository ever be updated again, you can download and unzip the condensed_20xx.json.zip files and use the code below.

library(jsonlite)

tweets <- data.frame()    
for(i in 2009:2018){
x <- fromJSON(txt=paste0('data/trump_tweets_all/condensed_',i,'.json'), simplifyDataFrame = T) %>% map_df(rev)
tweets <- rbind.data.frame(tweets, rev(x))
}
rm(x)

save(tweets,file='data/trump-tweets.RData')

Data Cleaning and Exploring

Having loaded the data, which consists of a single data frame called tweets, and examine the contents:

load('trump-tweets.RData')
str(tweets)

## tibble [36,307 × 8] (S3: tbl_df/tbl/data.frame)
##  $ is_retweet             : logi [1:36307] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ favorite_count         : int [1:36307] 202 3 2 27 1950 13 10 6 2 5 ...
##  $ in_reply_to_user_id_str: chr [1:36307] NA NA NA NA ...
##  $ retweet_count          : int [1:36307] 253 2 3 8 1421 10 11 3 1 3 ...
##  $ created_at             : chr [1:36307] "Mon May 04 18:54:25 +0000 2009" "Tue May 05 01:00:10 +0000 2009" "Fri May 08 13:38:08 +0000 2009" "Fri May 08 20:40:15 +0000 2009" ...
##  $ text                   : chr [1:36307] "Be sure to tune in and watch Donald Trump on Late Night with David Letterman as he presents the Top Ten List tonight!" "Donald Trump will be appearing on The View tomorrow morning to discuss Celebrity Apprentice and his new book Th"| __truncated__ "Donald Trump reads Top Ten Financial Tips on Late Show with David Letterman: http://tinyurl.com/ooafwn - Very funny!" "New Blog Post: Celebrity Apprentice Finale and Lessons Learned Along the Way: http://tinyurl.com/qlux5e" ...
##  $ id_str                 : chr [1:36307] "1698308935" "1701461182" "1737479987" "1741160716" ...
##  $ source                 : chr [1:36307] "Twitter Web Client" "Twitter Web Client" "Twitter Web Client" "Twitter Web Client" ...

We’ll start by turning the data frame into a tibble, and turning the date into a format that will be easier to work with later on. We use the parse_datetime() function from the lubridate package. Note, that we do not look into processing dates and times in this tutorial, but if you are interested the relevant chapter of R4DS is here. An example of the parsed date is shown below.

# turn into a tibble
tweets <- as_tibble(tweets)
# parse the date
tweets <- tweets %>% mutate(date = parse_datetime(str_sub(tweets$created_at,5,30), '%b %d %H:%M:%S %z %Y'))
tweets$date[1]

## [1] "2009-05-04 18:54:25 UTC"

Once the dates and times have been appropriately parsed we can perform various operations on them. For example, below we work out the earlier and latest tweets in the data set, and the duration of time covered by the data set.

paste("Earliest tweet date:",min(tweets$date))

## [1] "Earliest tweet date: 2009-05-04 18:54:25"

paste("Latest tweet date:",max(tweets$date))

## [1] "Latest tweet date: 2018-12-31 23:53:06"

paste("Duration of time covered in this data set is", round(max(tweets$date) - min(tweets$date),2))

## [1] "Duration of time covered in this data set is 3528.21"

How many months are covered in this period?

n_months <- interval(min(tweets$date), max(tweets$date)) %/% months(1) + 1 #Add 1 to determine how MANY months were represented
n_months

## [1] 116

Let’s look at a few tweets. The $slice_sample()$ function randomly selects ($n=5$) rows.

set.seed(5073)
sample_tweets <- tweets %>% slice_sample(n = 3) %>% select(date,text)
sample_tweets

## # A tibble: 3 × 2
##   date                text                                                      
##   <dttm>              <chr>                                                     
## 1 2017-01-27 23:46:22 "I promise that our administration will ALWAYS have your …
## 2 2015-01-05 03:49:46 "\"@NicoleAMarin: You're Fired! It's music to my ears whe…
## 3 2012-09-28 18:00:42 "\"Trump buys mansion adjacent to family winery\" http://…

We start by plotting the number of tweets Trump has made over time. Retweets are shown in blue.

ggplot(tweets, aes(x = date, fill = is_retweet)) +
  geom_histogram(position = 'identity', bins = n_months, show.legend = FALSE)

Most of the tweets are posted by Trump himself, and very few are retweets. Given that trump only started retweeting from 2016, it begs the question when did retweeting start? was it even available before 2016? On November 05, 2009 Twitter started a limited roll out of the ‘retweet’ feature to its users. So Trump simply did not choose to retweet prior to 2016. Even till present date, we see that Trump more often than not tweets himself, instead of retweeting.

Looking back at the sample tweets we had, shown below:

sample_tweets[,2]

## # A tibble: 3 × 1
##   text                                                                          
##   <chr>                                                                         
## 1 "I promise that our administration will ALWAYS have your back. We will ALWAYS…
## 2 "\"@NicoleAMarin: You're Fired! It's music to my ears when it comes from @rea…
## 3 "\"Trump buys mansion adjacent to family winery\" http://t.co/sTJXhgbK via @t…

We see from the few example tweets above that tweets have a particular format, but that that format is quite hard to pin down. There are special characters like @ and # that mean something, some tweets have links while others do not, some are a single sentence and others are several, some mention people by name, and so on. This begs the question - How are we going to structure this data in a meaningful way? To answer this question, we’re going to go back a bit and talk about “tidy” data.

Tidy Data

Tidy data is basically just a way of consistently organizing your data that often makes subsequent analysis easier, particularly if you are using tidyverse packages. Getting your data into this format requires some upfront work, but that work pays off in the long term. Tidy data has three requirements:

Each variable has its own column.
Each observation has its own row.
Each value has its own cell.

This only really makes sense once you work out what the variables and observations are. Of course you can’t define a variable as “what’s in a column”, or all data will be tidy!

Often the two main problems may be that:

One variable might be spread across multiple columns.
One observation might be scattered across multiple rows.

In general, to fix these problems, you’ll need the two most important functions in tidyr: pivot_longer() and pivot_wider().

For example, suppose we have the following data frame, containing movie ratings for 3 users and 4 movies:

untidy_ratings <- tribble(
  ~user_id, ~age, ~city, ~movie1, ~movie2, ~movie3,
  1, 49, 'Cpt', 5, NA, NA,
  2, 20, 'Cpt', 3, 3, 1, 
  3, 30, 'Jhb', NA, 5, 1
)
untidy_ratings

## # A tibble: 3 × 6
##   user_id   age city  movie1 movie2 movie3
##     <dbl> <dbl> <chr>  <dbl>  <dbl>  <dbl>
## 1       1    49 Cpt        5     NA     NA
## 2       2    20 Cpt        3      3      1
## 3       3    30 Jhb       NA      5      1

This data frame is not tidy because movie ratings are spread across multiple columns.

The “tidy” way to think about this data is that the variables are the user_id, the user’s demographic variables (age, city), the title of the movie, and the rating given. At the moment some column names (movie1, movie2, movie3) are actually values of the variable. This is a common problem with untidy data. The observations in this case are user-movie combinations. Note that this isn’t to say that “untidy” data is never useful - in fact, data in the format above is often used when building recommender systems. But there are advantages to working with tidy data, especially if you are using packages from the tidyverse.

To make the data frame tidy, we use the pivot.longer() function from the tidyr package, which provides a number of functions for getting data into (and out of) tidy format.

Note: Previously the function gather() would have been used for this purpose, but this function has been “retired”.

tidy_ratings <- untidy_ratings %>% pivot_longer(c(movie1, movie2, movie3), names_to = 'title', values_to = 'rating')
tidy_ratings %>% head(5)

## # A tibble: 5 × 5
##   user_id   age city  title  rating
##     <dbl> <dbl> <chr> <chr>   <dbl>
## 1       1    49 Cpt   movie1      5
## 2       1    49 Cpt   movie2     NA
## 3       1    49 Cpt   movie3     NA
## 4       2    20 Cpt   movie1      3
## 5       2    20 Cpt   movie2      3

This data frame shows all user-movie combinations. Those observations with no rating (because the user has not seen the movie) are given an NA. We say that missing values are explicitly represented. Another way of representing missing values is to omit the corresponding observation from the data frame. This is an implicit representation of a missing value. To use the implicit representation we just set values_drop_na = TRUE when using pivot_longer().

tidy_ratings_imp <- untidy_ratings %>% pivot_longer(c(movie1, movie2, movie3), names_to = 'title', values_to = 'rating', values_drop_na = T)
tidy_ratings_imp

## # A tibble: 6 × 5
##   user_id   age city  title  rating
##     <dbl> <dbl> <chr> <chr>   <dbl>
## 1       1    49 Cpt   movie1      5
## 2       2    20 Cpt   movie1      3
## 3       2    20 Cpt   movie2      3
## 4       2    20 Cpt   movie3      1
## 5       3    30 Jhb   movie2      5
## 6       3    30 Jhb   movie3      1

And we can get back to our original data frame with missing values (NA), by using complete function.

tidy_ratings_imp %>% complete(nesting(user_id, age, city), title) %>% head(5)

## # A tibble: 5 × 5
##   user_id   age city  title  rating
##     <dbl> <dbl> <chr> <chr>   <dbl>
## 1       1    49 Cpt   movie1      5
## 2       1    49 Cpt   movie2     NA
## 3       1    49 Cpt   movie3     NA
## 4       2    20 Cpt   movie1      3
## 5       2    20 Cpt   movie2      3

We can move in the opposite direction (spreading a variable across multiple columns) using the pivot_wider() function.

untidy_ratings <- tidy_ratings %>% pivot_wider(names_from = 'title', values_from = 'rating')
untidy_ratings

## # A tibble: 3 × 6
##   user_id   age city  movie1 movie2 movie3
##     <dbl> <dbl> <chr>  <dbl>  <dbl>  <dbl>
## 1       1    49 Cpt        5     NA     NA
## 2       2    20 Cpt        3      3      1
## 3       3    30 Jhb       NA      5      1

Note that its not always the case that a “long” data format is tidy and a “wide” format is not. For example, the data frame below is tidy.

# A tibble: 6 x 4
town month avgtemp rainfall
<chr> <chr>   <dbl>    <dbl>
1     A   Jan      24       12
2     B   Jan      27       10
3     C   Jan      30       16
4     A   Jun      14       22
5     B   Jun      20       62
6     C   Jun       5       16

but it would not be tidy if we reshape the data into a “longer” format:

# A tibble: 12 x 4
town month  weather    value
<chr> <chr>    <chr>    <dbl>
1     A   Jan  avgtemp       24
2     B   Jan  avgtemp       27
3     C   Jan  avgtemp       30
4     A   Jun  avgtemp       14
5     B   Jun  avgtemp       20
6     C   Jun  avgtemp        5
7     A   Jan rainfall       32
8     B   Jan rainfall       30
9     C   Jan rainfall       36
10     A   Jun rainfall       22
11     B   Jun rainfall       22
12     C   Jun rainfall       16

There are several more examples in Chapter 12 of R4DS, which covers tidy data and the tidyr package. It also covers the situation when we have one column that contains two variables.

Ultimately, note that in general we cannot say that data in long format is tidy and wide format is untidy, per say, as we have seen from the above example. In general, it often depends on your use case and the data you need.

Tokenization

Tidy text is defined by the authors of tidytext as “a table with one-token-per-row” (see Chapter 1 of TMR).

A token is a whatever unit of text is meaningful for your analysis: it could be a word, a word pair, a sentence, a paragraph, a chapter, whatever.

That means that the process of getting text data tidy is largely a matter of

Deciding what the level of the analysis is going to be - what the “token” is.
Splitting the text into tokens, a process called tokenization.

Tokenization is the process of splitting text up into the units that we are interested in analyzing. The unnest_tokens() function performs tokenization by splitting text up into the required tokens, and creating a new data frame with one token per row i.e. tidy text data.

Refering back to the example we have introduced; suppose we want to analyze the individual words Trump uses in his tweets. We do this by unnest_tokens(tweets, word, text, token = 'words'). Note the arguments passed to unnest_tokens():

the “messy” data frame containing the text (tweets)
the variable name we want to use for the tokens in the new tidy data frame (word)
the variable name where the text is stored in the “messy” data frame (text i.e. the tweets are in tweets$text)
the unit of tokenization (token = 'words')

So below when we implement tokenization with unnest_tokens(), the first argument: is the tibble, second: output (what object we are creating), third: which object in the tibble are we created this from, and lastly: what is the tokenization rule.

unnest_tokens(sample_tweets, word, text, token = 'words') %>% head(6)

## # A tibble: 6 × 2
##   date                word          
##   <dttm>              <chr>         
## 1 2017-01-27 23:46:22 i             
## 2 2017-01-27 23:46:22 promise       
## 3 2017-01-27 23:46:22 that          
## 4 2017-01-27 23:46:22 our           
## 5 2017-01-27 23:46:22 administration
## 6 2017-01-27 23:46:22 will

We note that some stuff comes up as words that we don’t want, like “https” and some links etc. Let’s see what happens if we tokenize by sentences:

unnest_tokens(sample_tweets, sentences, text, token = 'sentences') %>% select(sentences, everything())  %>% head(5)

## # A tibble: 5 × 2
##   sentences                                                  date               
##   <chr>                                                      <dttm>             
## 1 "i promise that our administration will always have your … 2017-01-27 23:46:22
## 2 "we will always be with you!"                              2017-01-27 23:46:22
## 3 "https://t.co/d0aowhoh4x"                                  2017-01-27 23:46:22
## 4 "\"@nicoleamarin: you're fired!"                           2015-01-05 03:49:46
## 5 "it's music to my ears when it comes from @realdonaldtrum… 2015-01-05 03:49:46

A really nice feature of tidytext is that you can tokenize by regular expressions. This gives you a lot of flexibility in deciding what you want a token to constitute. For example, for analyzing tweet data we can explicitly include symbols like @ and # that mean something, and exclude symbols like ? and ! that don’t add anything (unless we want to include them, in which case we can!)

unnest_tokens(sample_tweets, word, text, token = 'regex', pattern = "[^\\w_#@']") %>% head(6)

## # A tibble: 6 × 2
##   date                word          
##   <dttm>              <chr>         
## 1 2017-01-27 23:46:22 i             
## 2 2017-01-27 23:46:22 promise       
## 3 2017-01-27 23:46:22 that          
## 4 2017-01-27 23:46:22 our           
## 5 2017-01-27 23:46:22 administration
## 6 2017-01-27 23:46:22 will

We’re now in a position to transform the full set of tweets into tidy text format. We’ll tokenize with the regular expression we used in the last example above.

unnest_reg <- "[^\\w_#@']"

tidy_tweets <- tweets %>% 
  mutate(text, text = str_replace_all(text, "’", "'")) %>%  #replace curly apostrophe with straight
  unnest_tokens(word, text, token = 'regex', pattern = unnest_reg) %>%
  select(date, word, favorite_count)

Let’s plot the most commonly used words:

tidy_tweets %>%
  count(word, sort = TRUE) %>%
  filter(rank(desc(n)) <= 20) %>%
  ggplot(aes(reorder(word, n), n)) + geom_col() + coord_flip() + xlab('')

Not very useful! What’s happening here, unsurprisingly, is that common words like “the”, “to”, etc are coming up most often. We can tell tidytext to ignore these words (which are called “stop words”). Let’s have a look at a sample of stop words contained in the dictionary used by tidytext:

sort(sample(stop_words$word,15))

##  [1] "different"  "for"        "hadn't"     "here's"     "i'll"      
##  [6] "in"         "interested" "isn't"      "oh"         "ordering"  
## [11] "seems"      "tends"      "ways"       "wherever"   "you"

There are also some words like “http” and “www” that are not stop words but that we don’t want. We need to remove these “manually” using str_replace_all() with a regular expression of our choice. The regular expression below is a bit more complex than what we’ve used before.

We look at a few examples below. Play around with the input tweets and the regular expression to get a good idea of what everything does.

# pattern that we want to remove (replace with nothing)
replace_reg <- '(https?:.*?(\\s|.$))|(www.*?(\\s|.$))|&amp;|&lt;|&gt;'

# tweet with a www. link
tweets$text[145]
str_replace_all(tweets$text[145], replace_reg, '')

# tweet with an http link at end of tweet
tweets$text[194]
str_replace_all(tweets$text[194], replace_reg, '')

# tweet with multiple &amp; 
tweets$text[36185]
str_replace_all(tweets$text[36185], replace_reg, '')

The basic idea is:

https?: finds http: or https:.
.* finds the longest match, which we don’t want (why?), so we use .*?, which finds the shortest match.
the remaining part in round brackets (\\s|.$)) says “go till you hit a space or the end of the string”.
same applies for www.
&, <, > sometimes appear in the tweets to indicate &, <, > respectively.

Now we can put it all together – first remove the retweets (we only want to look at Trump’s own tweets), then remove the non-words, unnest into words, and remove stop words.

# pattern that we want to remove (replace with nothing)
replace_reg <- '(https?:.*?(\\s|.$))|(www.*?(\\s|.$))|&amp;|&lt;|&gt;'

unnest_reg <- "[^\\w_#@']"

tidy_tweets <- tweets %>% 
  mutate(text, text = str_replace_all(text, "’", "'")) %>%             #replace curly apostrophe with straight
  filter(is_retweet == FALSE) %>%                                      #remove retweets
  mutate(text = str_replace_all(text, replace_reg, '')) %>%            #remove with a reg exp (http, www, etc)
  unnest_tokens(word, text, token = 'regex', pattern = unnest_reg) %>% #unnest tokens
  filter(!word %in% stop_words$word, str_detect(word, '[a-z]')) %>%    #remove stop words and tokens without letters
  select(date,word,favorite_count)

We again plot the most commonly used tokens in our newly-cleaned data frame.

tidy_tweets %>%
  count(word, sort = TRUE) %>%
  filter(rank(desc(n)) <= 20) %>%
  ggplot(aes(reorder(word, n), n)) + geom_col() + coord_flip() + xlab('')

It turns out Trump likes tweeting about himself, mostly.

We can also see whether being president has changed the most commonly used words he uses. To do this we first create a new variable, a binary indicator of whether a tweet was made before or after Trump became president. We do this by comparing the date of the tweet to the date of the US election (8th November 2016). Note that once we have the date in a recognized format like ymd or dmy, this comparison can be trivially done.

tidy_tweets <- tidy_tweets %>% mutate(is_potus = (date > ymd(20161108)))

options(repr.plot.width=6, repr.plot.height=5) # make plot size bit bigger for next plots

tidy_tweets %>%
  group_by(is_potus) %>%
  count(word, sort = TRUE) %>%
  filter(rank(desc(n)) <= 20) %>%
  ggplot(aes(reorder(word, n), n, fill = is_potus)) + geom_col() + coord_flip() + xlab('')

Interesting insights are brought out here. He only started mentioning words like fake (news), democrats and military for example when he became president. He in fact did speak about Obama (or rather mention that name) quite often when he was not president and not when he was.

The plot above is a bit unsatisfying - there are obviously a lot more tweets pre-presidency, and that makes it difficult to see what is happening in the post-presidency frequencies. Below we transform the absolute frequencies into relative ones and plot those.

total_tweets <- tidy_tweets %>% 
  group_by(is_potus) %>% 
  summarise(total = n())

tidy_tweets %>%
  group_by(is_potus) %>% 
  count(word, sort = TRUE) %>%                   #count the number of times word used 
  left_join(total_tweets, by = 'is_potus') %>%   #add the total number of tweets made (pre- or post-potus)
  mutate(freq = n/total) %>%                     #add relative frequencies
  filter(rank(desc(freq)) < 20) %>%
  ggplot(aes(reorder(word, freq), freq, fill = is_potus)) + 
  geom_col() + 
  coord_flip() + 
  xlab('') +
  facet_grid(.~is_potus)

Below we show a wordcloud of Trump’s tweets after he became president. Wordclouds are not particularly informative – they just plot words proportional to their frequency of use and position them in an attractive way. This uses the wordcloud package.

tidy_tweets %>%
  filter(is_potus == TRUE) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

Unigrams, bigrams, n-grams

So far we’ve considered words as individual units. Using this we can simply expand on considering their their relationships to sentiments or to documents. However, many interesting text analyses are based on the relationships between words, whether examining which words tend to follow others immediately, or that tend to co-occur within the same documents.

Now, we introduce and lore some of the methods tidytext offers for calculating and visualizing relationships between words in your text dataset. This includes the token = “ngrams” argument, which tokenizes by pairs of adjacent words rather than by individual ones. More information, found here.

An n-gram is a sequence of n words in a text. Uni-grams are single words, bi-grams are pairs of adjacent words, tri-grams are sequences of three words, and so on.

The unnest_tokens() function allows you to easily extract n-grams using the “n-grams” token.

# bigrams
unnest_tokens(sample_tweets, bigram, text, token = 'ngrams', n = 2) %>% head(6)

## # A tibble: 6 × 2
##   date                bigram             
##   <dttm>              <chr>              
## 1 2017-01-27 23:46:22 i promise          
## 2 2017-01-27 23:46:22 promise that       
## 3 2017-01-27 23:46:22 that our           
## 4 2017-01-27 23:46:22 our administration 
## 5 2017-01-27 23:46:22 administration will
## 6 2017-01-27 23:46:22 will always

Let’s try tokenize three words (tri-grams).

# trigrams
unnest_tokens(sample_tweets, bigram, text, token = 'ngrams', n = 3) %>% head(6)

## # A tibble: 6 × 2
##   date                bigram                    
##   <dttm>              <chr>                     
## 1 2017-01-27 23:46:22 i promise that            
## 2 2017-01-27 23:46:22 promise that our          
## 3 2017-01-27 23:46:22 that our administration   
## 4 2017-01-27 23:46:22 our administration will   
## 5 2017-01-27 23:46:22 administration will always
## 6 2017-01-27 23:46:22 will always have

We extract the full set of bi-grams below, and do the same cleaning up we did for unigrams earlier. Removing stop words is a bit trickier with bi-grams. We need to separate each bi-gram into its constituent words, remove the stop words, and then put the bi-grams back together again. Both separate() and unite() are tidyr functions.

replace_reg <- '(https?:.*?(\\s|.$))|(www.*?(\\s|.$))|&amp;|&lt;|&gt;'

# tokenization
tweet_bigrams <- tweets %>%
  filter(is_retweet == FALSE) %>%
  mutate(text = str_replace_all(text, replace_reg, '')) %>%
  unnest_tokens(bigram, text, token = 'ngrams', n = 2)

# separate the bigrams 
bigrams_separated <- tweet_bigrams %>%
  separate(bigram, c('word1', 'word2'), sep = ' ')

# remove stop words
bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word)

# join up the bigrams again
bigrams_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = ' ')

We can now see what the most common bi-grams are. Note that these are very different from the most common words. That is, they definitely provide different and useful information over and above what the unigrams did.

bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE) %>% 
  filter(rank(desc(n)) <= 10) %>% 
  na.omit()  #if a tweet contains just one word, then the bigrams will return NA

bigram_counts %>% head(5)

## # A tibble: 5 × 3
##   word1   word2       n
##   <chr>   <chr>   <int>
## 1 donald  trump    1105
## 2 fake    news      339
## 3 crooked hillary   270
## 4 hillary clinton   262
## 5 white   house     260

Bi-grams, groups of two words, which were used together consecutively, that appeared most often after removing all of the stop words, and other non-words stuff.

Visualising Bi-grams

This one-bigram-per-row format is helpful for exploratory analyses of the text. We may be interested in visualizing all of the relationships among words simultaneously, rather than just the top few at a time. As one common visualization, we can arrange the words into a network, or “graph.” Here we’ll be referring to a “graph” not in the sense of a visualization, but as a combination of connected nodes. A graph can be constructed from a tidy object since it has three variables:

from: the node an edge is coming from to: the node an edge is going towards *weight: A numeric value associated with each edge

The igraph package has many powerful functions for manipulating and analyzing networks. One way to create an igraph object from tidy data is the graph_from_data_frame() function, which takes a data frame of edges with columns for “from”, “to”, and edge attributes (in this case n):

library(igraph)

# filter for only relatively common combinations
bigram_graph <- bigram_counts %>%
  filter(n > 20) %>%
  graph_from_data_frame()

bigram_graph

## IGRAPH 01ac3e4 DN-- 16 9 -- 
## + attr: name (v/c), n (e/n)
## + edges from 01ac3e4 (vertex names):
## [1] donald   ->trump      fake     ->news       crooked  ->hillary   
## [4] hillary  ->clinton    white    ->house      celebrity->apprentice
## [7] president->obama      trump    ->tower      witch    ->hunt

We can now use the ggraph package which has visualization methods for graph objects. This package implements these visualizations in terms of the grammar of graphics, which we are already familiar with from ggplot2.

We can convert an igraph object into a ggraph with the ggraph function, after which we add layers to it, much like layers are added in ggplot2. For example, for a basic graph we need to add three layers: nodes, edges, and text.

library(ggraph)
set.seed(2017)

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)

The above shows common bi-grams in Donald Trump’s tweets, showing those that occurred more than 20 times and where neither word was a stop word. We can visualize some details of the text structure. For example, we see that tower and donald have common center of nodes, trump. This makes sense. Other than that there is not much we can dive into.

Note we can make some visual improvements on the above figure. We, add the edge_alpha aesthetic to the link layer to make links transparent based on how common or rare the bi-gram is. We add directionality with an arrow, constructed using grid::arrow(), including an end_cap option that tells the arrow to end before touching the node. We tinker with the options to the node layer to make the nodes more attractive (larger, blue points). We add a theme that’s useful for plotting networks, theme_void().

set.seed(2020)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

It may take some experimentation with ggraph to get your networks into a presentable format like this, but the network structure is useful and flexible way to visualize relational tidy data.

Note that this is a visualization of a Markov chain, a common model in text processing. In a Markov chain, each choice of word depends only on the previous word.

More useful methods and techniques are found in the Text Mining for R textbook, here, with $n$-grams.

Example: Text Generation for Twitter

One area of natural language processing is (natural language) generation, roughly speaking an attempt to generate realistic looking and sounding text and speech from some data or knowledge base. In this section we’ll build a model for generating Trump-like tweets. We do this by looking at sequences of words (or more generally n-grams) that Trump uses.

Uni-gram Based Generator

We begin by generating Trump-like sentences. We do this by:

Creating a dictionary of all words Trump has started a sentence with.
Randomly choosing one of these “start words”.
Identify all the instances in which he used that word, and in each case extract the “word” that followed it. A “word” can be a full stop, indicating the end of the sentence. Some words may appear more than once. Create a list with these “following words” and the frequency with which they occur on the list.
To select the next word in the sentence, we choose a word from the list, either totally randomly or with probability proportional to the frequency with which each word appears on the list.
Repeat steps 3 and 4 until a full stop is reached.

We still need to turn sentences into tweets - a simple way is just to generate sentences until you are out of space (140 characters).

Previously, full stops were included as separating characters i.e. we split a string wherever we encountered a full stop (see unnest_reg above, and compare below). Now we want to treat a full stop as equivalent to a word. We thus need to create a new tidy tweets data frame that does this. So basically, since sentences end with a full stop, so we have to kind of see the full stop as a word now, since it is now something that can follow a previous word.

We also added some other things for cleaning of the data purposes Like for the word “don’t” we drop the apostrophe. Also “Donald J.Trump” comes up often too, so we shall remove that full stop since we don’t want it to interfere with what we have just done by making it a word.

unnest_reg <- "[^\\w_#@'\\.]" #note adding a fullstop to the reg exp used before

tidy_tweets_wstop <- tweets %>% 
  filter(is_retweet == FALSE) %>%
  mutate(text = str_replace_all(text, "[Dd]on't", 'dont')) %>%   #some additional cleaning
  mutate(text = str_replace_all(text, '(j\\.)|(J\\.)', 'j')) %>% #some additional cleaning
  mutate(text = str_replace_all(text, replace_reg, '')) %>%
  mutate(text = str_replace_all(text, '\\.', ' \\.')) %>%        #add a space before fullstop so counted as own word
  unnest_tokens(word, text, token = 'regex', pattern = unnest_reg) %>%
  select(id_str, date, word) %>%
  group_by(id_str) %>%           #group words by tweet
  mutate(next_word = lead(word)) #the "lead" and "lag" operators can be very useful!

Let’s have a look at the data frame we just constructed.

head(tidy_tweets_wstop, 6)

## # A tibble: 6 × 4
## # Groups:   id_str [1]
##   id_str     date                word  next_word
##   <chr>      <dttm>              <chr> <chr>    
## 1 1698308935 2009-05-04 18:54:25 be    sure     
## 2 1698308935 2009-05-04 18:54:25 sure  to       
## 3 1698308935 2009-05-04 18:54:25 to    tune     
## 4 1698308935 2009-05-04 18:54:25 tune  in       
## 5 1698308935 2009-05-04 18:54:25 in    and      
## 6 1698308935 2009-05-04 18:54:25 and   watch

We now count the number of times that each word pair (word followed by next_word) occurs. This is the same as a bi-gram frequency count we did above, just that we’ve included full stops.

transitions <- tidy_tweets_wstop %>%
  group_by(word,next_word) %>%
  count() %>% 
  arrange(desc(n)) %>% 
  ungroup() # remember to ungroup else later steps are slow!
transitions

## # A tibble: 234,662 × 3
##    word   next_word     n
##    <chr>  <chr>     <int>
##  1 .      <NA>      10344
##  2 .      .          4701
##  3 thank  you        2239
##  4 of     the        2214
##  5 will   be         1880
##  6 in     the        1624
##  7 is     a          1260
##  8 a      great      1163
##  9 donald trump      1110
## 10 .      i          1063
## # … with 234,652 more rows

We see a lot of full stops followed by NA. And a lot of full stops followed by full stops.

The last full stop of every tweet is followed by an NA. This causes problems later on, so we replace the NA with another full stop.

transitions$next_word[is.na(transitions$next_word)] <- '.'

Finally, we unleash our uni-gram based Trump tweeter. This model randomly samples from the list of next words. It continues to create sentences until the tweet is over 140 characters and full stop is reached. Thus the tweets it generates will be too long (more than 140 characters).

# trump v1, unigram model, random sampling
set.seed(5073)

# start at the end of a sentence (so next word is a start word)
current <- '.'
result <- '.'
keep_going <- TRUE
while(keep_going == TRUE){
  
  # get next word
  next_word <- transitions %>% 
    filter(word == as.character(current)) %>% 
    slice_sample(n = 1) %>% # random sampling
    select(next_word)
  
  # combine with result so far
  result <- str_c(result,' ', next_word)
  current <- next_word
  
  # does the current word appear in the 'word' column?
  n_current <- sum(transitions$word == as.character(current))
  
  # keep going if can look up current word and tweet is < 140 or current word is not .
  keep_going <- ifelse(n_current == 0, FALSE,
                       ifelse(nchar(result) < 140, TRUE, 
                              ifelse(str_detect(current,'\\.'), FALSE, TRUE)))
  
}

# show text generation
result

## [1] ". best cast or made #betterwithfriends . hosting let everyone i drove into any clue someone should thank steve . where will washington . being incorrect ill say isis would fire that garbage alcohol in millions restoring fiscal issues that we ran him ok but obstructionists and im pretty weak america join together our multi faceted transactions i walked past cycle ."

The generated tweets have traces of Trump but are more or less gibberish. We can try improve the generation model by changing the sampling to be proportional to the bi-gram frequency rather than random.

# trump v2, unigram model, sample using transition probabilities
set.seed(5073)


# start at the end of a sentence (so next word is a start word)
current <- '.'
result <- '.'
keep_going <- TRUE
while(keep_going == TRUE){
  
  # get next word
  next_word <- transitions %>% 
    filter(word == as.character(current)) %>% 
    slice_sample(n = 1, weight_by = n) %>% # proportional to count
    select(next_word)
  
  # combine with result so far
  result <- str_c(result,' ', next_word)
  current <- next_word
  
  # does the current word appear in the 'word' column?
  n_current <- sum(transitions$word == as.character(current))
  
  # keep going if can look up current word and tweet is < 140 or current word is not .
  keep_going <- ifelse(n_current == 0, FALSE,
                       ifelse(nchar(result) < 140, TRUE, 
                              ifelse(str_detect(current,'\\.'), FALSE, TRUE)))
  
}
result

## [1] ". make a fundraiser tonight at trump make you forgot to issue on food stamp policies have not want to usa pageant in congress travelling on april s ability of a terrorist pan handle the public about gadhafi ."

Its hard to tell, but from a few runs it looks a little bit better but still not great.

Bi-gram Based Generator

Let’s see if we do any better if we look at transitions between bigrams rather than between words. As for the unigram model, we first need to add full stops, and then count how many times each transition (now between pairs of bigrams) occurs.

# we've already lagged unigrams with full stops before, use these to create lagged bigrams
bigrams_wstop <- tidy_tweets_wstop %>% 
  filter(!is.na(word) & !is.na(next_word)) %>% #don't want to unite with NAs
  unite(bigram, word, next_word, sep = ' ') %>%
  mutate(next_bigram = lead(bigram, 2))
bigrams_wstop

## # A tibble: 638,248 × 4
## # Groups:   id_str [35,182]
##    id_str     date                bigram       next_bigram 
##    <chr>      <dttm>              <chr>        <chr>       
##  1 1698308935 2009-05-04 18:54:25 be sure      to tune     
##  2 1698308935 2009-05-04 18:54:25 sure to      tune in     
##  3 1698308935 2009-05-04 18:54:25 to tune      in and      
##  4 1698308935 2009-05-04 18:54:25 tune in      and watch   
##  5 1698308935 2009-05-04 18:54:25 in and       watch donald
##  6 1698308935 2009-05-04 18:54:25 and watch    donald trump
##  7 1698308935 2009-05-04 18:54:25 watch donald trump on    
##  8 1698308935 2009-05-04 18:54:25 donald trump on late     
##  9 1698308935 2009-05-04 18:54:25 trump on     late night  
## 10 1698308935 2009-05-04 18:54:25 on late      night with  
## # … with 638,238 more rows

We now calculate the frequency count for each bigram-to-next_bigram transition. We first remove any rows where either bigram or next_bigram is missing.

# transition matrix
bigram_transitions <- bigrams_wstop %>%
  filter(!is.na(bigram) & !is.na(next_bigram)) %>%
  group_by(bigram,next_bigram) %>%
  count() %>% 
  arrange(desc(n)) %>% 
  ungroup() # remember to ungroup else later steps are slow!
bigram_transitions

## # A tibble: 497,313 × 3
##    bigram       next_bigram        n
##    <chr>        <chr>          <int>
##  1 . .          . .              873
##  2 make america great again      456
##  3 the u        .s .             440
##  4 art of       the deal         151
##  5 the art      of the           137
##  6 . think      like a           109
##  7 . thank      you .            103
##  8 please run   for president    101
##  9 i will       be interviewed    99
## 10 think like   a champion        89
## # … with 497,303 more rows

In the bigram model, a full stop is one “word” in a bigram. So we can’t start with only a full stop, like we did before. The approach we’ll take is to randomly select one of the bigrams Trump has used.

set.seed(5073)

# extract all starting rows
start_bigrams <- bigrams_wstop %>%
  group_by(id_str) %>%
  slice_head(n = 1) %>%    # takes the first row from each tweet
  ungroup()

# choose one starting bigram
current <- start_bigrams %>%
  slice_sample(n = 1) %>% 
  select(bigram) %>% 
  as.character()

current

## [1] "great job"

Finally we can put everything together to generate a tweet using the bigram model we just created.

# trump v3, bigram model, sample using transition probabilities

# start with starting bigram previously generated
result <- current
keep_going <- TRUE
while(keep_going == TRUE){
  
  # get next bigram
  next_bigram <- bigram_transitions %>% 
    filter(bigram == as.character(current)) %>% 
    slice_sample(n = 1, weight_by = n) %>% 
    select(next_bigram)
  
  # combine with result so far
  result <- str_c(result,' ', next_bigram)
  current <- next_bigram
  
  # does the current bigram appear in the bigram column?
  n_current <- sum(bigram_transitions$bigram == as.character(current))
  
  # keep going if can look up current bigram and tweet is < 140 or current bigram
  # does not contain a .
  keep_going <- ifelse(n_current == 0, FALSE,
                       ifelse(nchar(result) < 140, TRUE, 
                              ifelse(str_detect(current,'\\.'), FALSE, TRUE)))
}
result

## [1] "great job on completing our new secretary of defense james mattis and general chief of staff john kelly who is tough on crime border military vets and the @senategop we are appointing high quality federal district . ."

Part 2: Sentiment Analysis

Sentiment analysis is the study of the emotional content of a body of text. In this section, we shall provide an introduction to sentiment analysis, in which we will cover several things:

sentiment lexicons that come with tidytext.
Aggregating sentiments over words to assess sentiments of longer texts
Handling “negation” words like “not” that affect sentiment

We end of this section by using all the ideas above to to analyze the emotional content of Donald Trump’s tweets and examine how these have changed over time.

Note that, Chapter 2 of Text Mining in R covers sentiment analysis, and negation is handled in Chapter 4. Many of the ideas and some of the code in this workbook are drawn from these chapters.

Background and Introduction

A common and intuitive approach to text is sentiment analysis. In a grand sense, we are interested in the emotional content of some text, e.g. posts on Facebook, tweets, or movie reviews. Most of the time, this is obvious when one reads it, but if you have hundreds of thousands or millions of strings to analyze, you’d like to be able to do so efficiently.

When we read text, as humans, we infer the emotional content from the words used in the text, and some more subtle cues involving how these words are put together. Sentiment analysis tries to do the same thing algorithmically.

One way of approaching the problem is to assess the sentiment of individual words, and then aggregate the sentiments of the words in a body of text in some way. For example, if we can classify whether each word is positive, negative, or neutral, we can count up the number of positive, negative, and neutral words in the document and define that as the sentiment of the document. This is just one way - a particularly simple way - of doing document-level sentiment analysis.

When assessing the sentiment or emotional content of individual words, we usually make use of existing sentiment dictionaries (or “lexicons”) that have already done this using some kind of manual classification.

note: a token simply represnents a unit of text. If you put those together you have document and if you have a bunch of documents, then you have a corpus.

Setup

First load the packages we need for this section:

library(tidyverse)
library(tidytext)
library(textdata) 
library(stringr)
library(lubridate)

options(repr.plot.width=4, repr.plot.height=3) # set plot size in the notebook

We shall be using the Trump tweet data we used in the earlier part of this tutorial. Remember, this data contains the tweets from Trump. We need to get the data into tidy text format. These are the same operations we did in the previous section.

load('trump-tweets.RData')

# make data a tibble
tweets <- as_tibble(tweets)

# parse the date and add some date related variables
tweets <- tweets %>% 
  mutate(date = parse_datetime(str_sub(tweets$created_at, 5, 30), '%b %d %H:%M:%S %z %Y')) %>% 
  mutate(is_potus = (date > ymd(20161108))) %>%
  mutate(month = make_date(year(date), month(date)))

# turn into tidy text 
replace_reg <- '(http.*?(\\s|.$))|(www.*?(\\s|.$))|&amp;|&lt;|&gt;'
unnest_reg <- "[^\\w_#@']"
tidy_tweets <- tweets %>% 
  filter(is_retweet == FALSE) %>%                                      #remove retweets
  mutate(text = str_replace_all(text, replace_reg, '')) %>%            #remove stuff we don't want like links
  unnest_tokens(word, text, token = 'regex', pattern = unnest_reg) %>% #tokenize
  filter(!word %in% stop_words$word, str_detect(word, '[A-Za-z]')) %>% #remove stop words
  select(date, word, is_potus, favorite_count, id_str, month)          #choose the variables we need

Our data for this part of the tutorial is now ready. Let’s have a look at it again before discussing lexicons.

tidy_tweets %>% head(5)

## # A tibble: 5 × 6
##   date                word   is_potus favorite_count id_str     month     
##   <dttm>              <chr>  <lgl>             <int> <chr>      <date>    
## 1 2009-05-04 18:54:25 tune   FALSE               202 1698308935 2009-05-01
## 2 2009-05-04 18:54:25 watch  FALSE               202 1698308935 2009-05-01
## 3 2009-05-04 18:54:25 donald FALSE               202 1698308935 2009-05-01
## 4 2009-05-04 18:54:25 trump  FALSE               202 1698308935 2009-05-01
## 5 2009-05-04 18:54:25 late   FALSE               202 1698308935 2009-05-01

Sentiment Lexicons

The gist is that we are dealing with a specific, pre-defined vocabulary. Of course, any analysis will only be as good as the lexicon. The goal is usually to assign a sentiment score to a text, possibly an overall score, or a generally positive or negative grade. Given that, other analyses may be implemented to predict sentiment

The tidytext package comes with a four existing sentiment lexicons or dictionaries. These describe the emotional content of individual words in different formats, and have been put together manually. We will only be considering three of these.

afinn: a list of words given a positivity score between minus five (negative) and plus five (positive). The words have been manually labelled by Finn Arup Nielsen. See here and here for more details.
bing: a sentiment lexicon created by Bing Liu and collaborators. A list of words are labelled as “positive” or “negative”. More details here.
nrc: a sentiment lexicon put together by Saif Mohammad and Peter Turney using crowdsourcing on Amazon Mechanical Turk. Words are labelled as “positive” or “negative”, but also as “anger”, “anticipation”, “disgust”, “fear”, “joy”, “sadness”, “surprise”, or “trust”. A word can receive multiple labels. More details here.

The fourth – loughran is for use with financial documents.

afinn <- get_sentiments('afinn') 
bing <- get_sentiments('bing') 
save(afinn, bing, file = "dsfi-lexicons.Rdata")

We can now have a look at each of the lexicons:

load("dsfi-lexicons.Rdata")
afinn %>% slice_sample(n = 5) %>% head(6)

## # A tibble: 5 × 2
##   word      value
##   <chr>     <dbl>
## 1 motivate      1
## 2 darkness     -1
## 3 amazing       4
## 4 cruel        -3
## 5 attracted     1

bing %>% slice_sample(n = 5) %>% head(6)

## # A tibble: 5 × 2
##   word             sentiment
##   <chr>            <chr>    
## 1 godlike          positive 
## 2 aggrivation      negative 
## 3 incomprehensible negative 
## 4 illuminating     positive 
## 5 saintly          positive

Below we use the bing lexicon to add a new variable indicating whether each word in our tidy_tweets data frame is positive or negative. We use a left join here, which keeps all the words in tidy_tweets.

Words appearing in our tweets but not in the bing lexicon will appear as NA. We rename these “neutral”, but need to be a bit careful here. No sentiment lexicon contains all words, so some words that are actually positive or negative will be labelled as NA and hence “neutral”. We can avoid this problem by using an inner join rather than a left join, by filtering out neutral words later on, or by just keeping in mind that “neutral” doesn’t really mean “neutral”.

There’s one last issue: in the bing lexicon the word “trump” is positive, which will obviously skew the sentiment of Trump’s tweets, particularly bearing in mind he often tweets about himself! We manually recode the sentiment of this word to “neutral”.

tidy_tweets <- tidy_tweets %>% 
  left_join(bing) %>%                              #add sentiments (pos or neg)
  select(word, sentiment, everything()) %>%
  mutate(sentiment = ifelse(word == 'trump', NA, sentiment)) %>%     #'trump' is a positive word in the bing lexicon!
  mutate(sentiment = ifelse(is.na(sentiment), 'neutral', sentiment))

Let’s look at Trump’s 20 most common positive words:

tidy_tweets %>%
  filter(sentiment == 'positive') %>%
  count(word) %>%
  arrange(desc(n)) %>%
  filter(rank(desc(n)) <= 20) %>%
  ggplot(aes(reorder(word,n),n)) + geom_col() + coord_flip() + xlab('')

And the 20 most common negative words:

tidy_tweets %>%
  filter(sentiment == 'negative') %>%
  count(word) %>%
  arrange(desc(n)) %>%
  filter(rank(desc(n)) <= 20) %>%
  ggplot(aes(reorder(word,n),n)) + geom_col() + coord_flip() + xlab('')

Changes in Sentiment Over Time

Once we have attached sentiments to words in our data frame, we can analyze these in various ways. For example, we can examine trends in sentiment over time. Here we count the number of positive, negative and neutral words used each month and plot these. Because the neutral words dominate, its difficult to see any trends with them included. We therefore remove the neutral words before plotting.

sentiments_per_month <- tidy_tweets %>%
  group_by(month, sentiment) %>%
  summarize(n = n()) 

ggplot(filter(sentiments_per_month, sentiment != 'neutral'), aes(x = month, y = n, fill = sentiment)) +
  geom_col()

It seems to be relatively balanced throughout, although the variation in the number of tweets made each month makes it difficult to say with any certainty which sentiments dominate over time. We can improve the visualization by plotting the proportion of all words tweeted in a month that were positive or negative. The plot shows the raw proportions as well as smoothed versions of these.

sentiments_per_month <- sentiments_per_month %>% 
  left_join(sentiments_per_month %>% 
              group_by(month) %>% 
              summarise(total = sum(n))) %>%
  mutate(freq = n/total) 

sentiments_per_month %>% filter(sentiment != 'neutral') %>%
  ggplot(aes(x = month, y = freq, colour = sentiment)) +
  geom_line() + 
  geom_smooth(aes(colour = sentiment))

See that he initially had positive sentiment overall - before 2010. In fact proportion wise, he had positive sentiment all the way up to 2016. Then once he became president, this appeared to changed - post 2016. Ultimately, over time this difference between positive and negative sentiment became less distinct, and in fact appears to have majority negative sentiment tweets in more recent years.

We can fit a simple linear model to check with the proportion of negative words has increased over time. Strictly speaking the linear model is not appropriate as the response is bounded to lie between 0 and 1 - you could try fitting e.g. a binomial GLM instead.

model <- lm(freq ~ month, data = subset(sentiments_per_month, sentiment == 'negative'))
summary(model)

## 
## Call:
## lm(formula = freq ~ month, data = subset(sentiments_per_month, 
##     sentiment == "negative"))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.044031 -0.011884 -0.001153  0.013793  0.052094 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.821e-01  3.326e-02  -5.475 3.00e-07 ***
## month        1.534e-05  2.045e-06   7.502 2.12e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02052 on 105 degrees of freedom
## Multiple R-squared:  0.349,  Adjusted R-squared:  0.3428 
## F-statistic: 56.28 on 1 and 105 DF,  p-value: 2.118e-11

Aggregating Sentiment Over wWrds

So far we’ve looked at the sentiment of individual words, but how can we assess the sentiment of longer sequences of text, like bi-grams, sentences or entire tweets? One approach is to attach sentiments to each word in the longer sequence, and then add up the sentiments over words. This isn’t the only way, but it is relatively easy to do and fits in nicely with the use of tidy text data.

Suppose we want to analyze the sentiment of entire tweets. We’ll measure the positivity of a tweet by the difference in the number of positive and negative words used in the tweet.

sentiments_per_tweet <- tidy_tweets %>%
  group_by(id_str) %>%
  summarize(net_sentiment = (sum(sentiment == 'positive') - sum(sentiment == 'negative')),
            month = first(month))

To see if the measure makes sense, let’s have a look at the most negative tweets.

tweets %>% 
  left_join(sentiments_per_tweet) %>% 
  arrange(net_sentiment) %>% 
  head(5) %>%
  select(text, net_sentiment)

## # A tibble: 5 × 2
##   text                                                                   net_s…¹
##   <chr>                                                                    <int>
## 1 Where’s the Collusion? They made up a phony crime called Collusion, a…     -11
## 2 WITCH HUNT! There was no Russian Collusion. Oh, I see, there was no R…      -9
## 3 This is an illegally brought Rigged Witch Hunt run by people who are …      -9
## 4 How come every time I show anger, disgust or impatience, enemies say …      -8
## 5 So, the Democrats make up a phony crime, Collusion with the Russians,…      -8
## # … with abbreviated variable name ¹net_sentiment

And the most positive tweets:

tweets %>% 
  left_join(sentiments_per_tweet) %>% 
  arrange(desc(net_sentiment)) %>% 
  head(5) %>%
  select(text, net_sentiment)

## # A tibble: 5 × 2
##   text                                                                   net_s…¹
##   <chr>                                                                    <int>
## 1 "Congratulations to Patrick Reed on his great and courageous MASTERS …      10
## 2 "Thank you, @WVGovernor Jim Justice, for that warm introduction. Toni…       7
## 3 "Today, as we celebrate Hispanic Heritage Month, we share our gratitu…       7
## 4 "It is my great honor to be with so many brilliant, courageous, patri…       7
## 5 "\"Success is not the key to happiness. Happiness is the key to succe…       6
## # … with abbreviated variable name ¹net_sentiment

We can also look at trends over time. The plot below shows the proportion of monthly tweets that were negative (i.e. where the number of negative words exceeded the number of positive ones).

sentiments_per_tweet %>%
  group_by(month) %>%
  summarize(prop_neg = sum(net_sentiment < 0) / n()) %>%
  ggplot(aes(x = month, y = prop_neg)) +
  geom_line() + geom_smooth()

Interestingly, we see that around 2010 very few negative sentiment tweets, in contrast o more recent years.

Dealing with Negation

One problem we haven’t considered yet is what to do with terms like “not good”, where a positive word is negated by the use of “not” before it. We need to reverse the sentiment of words that are preceded by negation words like not, never, etc.

We’ll do this in the context of a sentiment analysis on bi-grams. We start by creating the bi-grams, and separating the two words making up each bi-gram. This is the same code used in the previous section.

bigrams_separated  <- tweets %>%
  filter(is_retweet == FALSE) %>%
  mutate(text = str_replace_all(text, replace_reg, '')) %>%
  unnest_tokens(bigram, text, token = 'ngrams', n = 2) %>%
  separate(bigram, c('word1', 'word2'), sep = ' ')

Then we use the bing sentiment dictionary to look up the sentiment of each word in each bi-gram.

bigrams_separated <- bigrams_separated %>% 
  # add sentiment for word 1
  left_join(bing, by = c(word1 = 'word')) %>%
  rename(sentiment1 = sentiment) %>%
  mutate(sentiment1 = ifelse(word1 == 'trump', NA, sentiment1)) %>%
  mutate(sentiment1 = ifelse(is.na(sentiment1), 'neutral', sentiment1)) %>%
  
  # add sentiment for word 2
  left_join(bing, by = c(word2 = 'word')) %>%
  rename(sentiment2 = sentiment) %>%
  mutate(sentiment2 = ifelse(word2 == 'trump', NA, sentiment2)) %>%
  mutate(sentiment2 = ifelse(is.na(sentiment2), 'neutral', sentiment2)) %>%
  select(month, word1, word2, sentiment1, sentiment2, everything())

bigrams_separated %>% head(5)

## # A tibble: 5 × 14
##   month      word1 word2 senti…¹ senti…² is_re…³ favor…⁴ in_re…⁵ retwe…⁶ creat…⁷
##   <date>     <chr> <chr> <chr>   <chr>   <lgl>     <int> <chr>     <int> <chr>  
## 1 2009-05-01 be    sure  neutral neutral FALSE       202 <NA>        253 Mon Ma…
## 2 2009-05-01 sure  to    neutral neutral FALSE       202 <NA>        253 Mon Ma…
## 3 2009-05-01 to    tune  neutral neutral FALSE       202 <NA>        253 Mon Ma…
## 4 2009-05-01 tune  in    neutral neutral FALSE       202 <NA>        253 Mon Ma…
## 5 2009-05-01 in    and   neutral neutral FALSE       202 <NA>        253 Mon Ma…
## # … with 4 more variables: id_str <chr>, source <chr>, date <dttm>,
## #   is_potus <lgl>, and abbreviated variable names ¹sentiment1, ²sentiment2,
## #   ³is_retweet, ⁴favorite_count, ⁵in_reply_to_user_id_str, ⁶retweet_count,
## #   ⁷created_at

Now we need a list of words that we consider to be negation words. We’ll use the following set, taken from TMR Chapter 4, and show a few examples.

negation_words <- c('not', 'no', 'never', 'without')

# show a few
filter(bigrams_separated, word1 %in% negation_words) %>% 
  head(10) %>% select(month, word1, word2, sentiment1, sentiment2) # for display purposes

## # A tibble: 10 × 5
##    month      word1 word2         sentiment1 sentiment2
##    <date>     <chr> <chr>         <chr>      <chr>     
##  1 2009-05-01 never be            neutral    neutral   
##  2 2009-05-01 not   be            neutral    neutral   
##  3 2009-05-01 not   a             neutral    neutral   
##  4 2010-07-01 never be            neutral    neutral   
##  5 2011-01-01 no    environmental neutral    neutral   
##  6 2011-07-01 no    revenue       neutral    neutral   
##  7 2011-07-01 not   negotiate     neutral    neutral   
##  8 2011-07-01 no    deal          neutral    neutral   
##  9 2011-07-01 not   answered      neutral    neutral   
## 10 2011-07-01 not   like          neutral    positive

We now reverse the sentiment of word2 whenever it is preceded by a negation word, and then add up the number of positive and negative words within a bi-gram and take the difference. That difference (a score from -2 to +2) is the sentiment of the bi-gram.

We do this in two steps for illustrative purposes. First we reverse the sentiment of the second word in the bi-gram if the first one is a negation word.

bigrams_separated <- bigrams_separated %>%
  
  # create a variable that is the opposite of sentiment2
  mutate(opp_sentiment2 = recode(sentiment2, 'positive' = 'negative',
                                 'negative' = 'positive',
                                 'neutral' = 'neutral')) %>%
  
  # reverse sentiment2 if word1 is a negation word
  mutate(sentiment2 = ifelse(word1 %in% negation_words, opp_sentiment2, sentiment2)) %>%
  
  # remove the opposite sentiment variable, which we don't need any more
  select(-opp_sentiment2)

Next, we calculate the sentiment of each bi-gram and join up the words in the bi-gram again.

bigrams_separated <- bigrams_separated %>%
  mutate(net_sentiment = (sentiment1 == 'positive') + (sentiment2 == 'positive') - 
           (sentiment1 == 'negative') - (sentiment2 == 'negative')) %>%
  unite(bigram, word1, word2, sep = ' ', remove = FALSE)
bigrams_separated %>% select(word1, word2, sentiment1, sentiment2, net_sentiment)

## # A tibble: 592,367 × 5
##    word1  word2  sentiment1 sentiment2 net_sentiment
##    <chr>  <chr>  <chr>      <chr>              <int>
##  1 be     sure   neutral    neutral                0
##  2 sure   to     neutral    neutral                0
##  3 to     tune   neutral    neutral                0
##  4 tune   in     neutral    neutral                0
##  5 in     and    neutral    neutral                0
##  6 and    watch  neutral    neutral                0
##  7 watch  donald neutral    neutral                0
##  8 donald trump  neutral    neutral                0
##  9 trump  on     neutral    neutral                0
## 10 on     late   neutral    neutral                0
## # … with 592,357 more rows

Below we show Trump’s most common positive and negative bigrams.

bigrams_separated %>%
  filter(net_sentiment > 0) %>% # get positive bigrams
  count(bigram, sort = TRUE) %>%
  filter(rank(desc(n)) < 20) %>%
  ggplot(aes(reorder(bigram,n),n)) + geom_col() + coord_flip() + xlab('')

bigrams_separated %>%
  filter(net_sentiment < 0) %>% # get negative bigrams
  count(bigram, sort = TRUE) %>%
  filter(rank(desc(n)) < 20) %>%
  ggplot(aes(reorder(bigram,n),n)) + geom_col() + coord_flip() + xlab('')

None of the most common negative bi-grams have negated words in them but some that are slightly less frequently used do. Notice that the joint most frequently used bi-gram below is “no wonder” - which is not really negative, although you can see how, using the approach we have taken, it has ended up classified as such. Cases like these would need to be handled on an individual basis.

bigrams_separated %>%
  filter(net_sentiment < 0) %>% # get negative bigrams
  filter(word1 %in% negation_words) %>% # get bigrams where first word is negation
  count(bigram, sort = TRUE) %>%
  filter(rank(desc(n)) < 20) %>%
  ggplot(aes(reorder(bigram,n),n)) + geom_col() + coord_flip() + xlab('')

Example: Romeo & Juliet

We look at sentiment in Shakespeare’s Romeo and Juliet. Let’s begin by loading the data, taken from online.

library(gutenbergr)
load("gutenberg_shakespeare.RData")
rnj <- works$`Romeo and Juliet`

We’ve got the text now, but there is still work to be done. We first slice off the initial parts we don’t want like title, author etc. Then we get rid of other tidbits that would interfere, using a little regex as well to aid the process.

rnj_filtered = rnj %>% 
  slice(-(1:49)) %>% 
  filter(!text==str_to_upper(text),            # will remove THE PROLOGUE etc.
         !text==str_to_title(text),            # will remove names/single word lines
         !str_detect(text, pattern='^(Scene|SCENE)|^(Act|ACT)|^\\[')) %>% 
  select(-gutenberg_id) %>% 
  unnest_tokens(sentence, input=text, token='sentences') %>% 
  mutate(sentenceID = 1:n())

rnj_filtered

## # A tibble: 3,318 × 2
##    sentence                                          sentenceID
##    <chr>                                                  <int>
##  1 two households, both alike in dignity,                     1
##  2 in fair verona, where we lay our scene,                    2
##  3 from ancient grudge break to new mutiny,                   3
##  4 where civil blood makes civil hands unclean.               4
##  5 from forth the fatal loins of these two foes               5
##  6 a pair of star-cross'd lovers take their life;             6
##  7 whose misadventur'd piteous overthrows                     7
##  8 doth with their death bury their parents' strife.          8
##  9 the fearful passage of their death-mark'd love,            9
## 10 and the continuance of their parents' rage,               10
## # … with 3,308 more rows

The following unnests the data to word tokens. In addition, you can remove stopwords like a, an, the etc., and tidytext comes with a stop_words data frame.

# show some of the matches
stop_words$word[which(stop_words$word %in% sentiments$word)] %>% head(20)

##  [1] "appreciate"    "appropriate"   "available"     "awfully"      
##  [5] "best"          "better"        "clearly"       "enough"       
##  [9] "like"          "liked"         "reasonably"    "right"        
## [13] "sensible"      "sorry"         "thank"         "unfortunately"
## [17] "unlikely"      "useful"        "welcome"       "well"

# remember to call output 'word' or antijoin won't work without a 'by' argument
rnj_filtered = rnj_filtered %>% 
  unnest_tokens(output=word, input=sentence, token='words') %>%   
  anti_join(stop_words)

Now we add the sentiments via the inner_join function. Here I use ‘bing’, but you can use another, and you might get a different result.

rnj_filtered %>% 
  count(word) %>% 
  arrange(desc(n))

## # A tibble: 3,288 × 2
##    word      n
##    <chr> <int>
##  1 thou    276
##  2 thy     165
##  3 love    140
##  4 thee    139
##  5 romeo   110
##  6 night    83
##  7 death    71
##  8 hath     64
##  9 sir      58
## 10 art      55
## # … with 3,278 more rows

rnj_sentiment = rnj_filtered %>% 
  inner_join(sentiments)
rnj_sentiment

## # A tibble: 2,077 × 3
##    sentenceID word       sentiment
##         <int> <chr>      <chr>    
##  1          1 dignity    positive 
##  2          2 fair       positive 
##  3          3 grudge     negative 
##  4          3 break      negative 
##  5          4 unclean    negative 
##  6          5 fatal      negative 
##  7          7 overthrows negative 
##  8          8 death      negative 
##  9          8 strife     negative 
## 10          9 fearful    negative 
## # … with 2,067 more rows

rnj_sentiment_bing = rnj_sentiment 
table(rnj_sentiment_bing$sentiment)

## 
## negative positive 
##     1244      833

Looks like this one is going to be a downer. The following visualizes the positive and negative sentiment scores as one progresses sentence by sentence through the work using the plotly package. I also show same information expressed as a difference (opaque line)

library(plotly)

ay <- list(
  tickfont = list(color = "#2ca02c40"),
  overlaying = "y",
  side = "right",
  # title = "sentiment difference",
  titlefont = list(textangle=45),
  zeroline = F
)

rnj_sentiment_bing %>% 
  arrange(sentenceID) %>% 
  mutate(positivity = cumsum(sentiment=='positive'),
         negativity = cumsum(sentiment=='negative')) %>% 
  plot_ly() %>% 
  add_lines(x=~sentenceID, y=~positivity, name='positive') %>% 
  add_lines(x=~sentenceID, y=~negativity, name='negative') %>%
  add_lines(x=~sentenceID, y=~positivity-negativity, name='difference',
            yaxis = "y2", 
            opacity=.25) %>% 
  layout(
    xaxis = list(dtick = 200),
    yaxis = list(title='absolute cumulative sentiment'),
    yaxis2 = ay
  )

It’s a close game until perhaps the midway point, when negativity takes over and despair sets in with the story.

Concluding Thoughts on Sentiment Analysis

general, sentiment analysis can be a useful exploration of data, but it is highly dependent on the context and tools used. Note also that ‘sentiment’ can be anything, it doesn’t have to be positive vs. negative. Any vocabulary may be applied, and so it has more utility than the usual implementation.

It should also be noted that the above demonstration is largely conceptual and descriptive. While fun, it’s a bit simplified. For starters, trying to classify words as simply positive or negative itself is not a straightforward endeavor. As we noted at the beginning, context matters, and in general you’d want to take it into account. Modern methods of sentiment analysis would use approaches like word2vec or deep learning to predict a sentiment probability, as opposed to a simple word match. Even in the above, matching sentiments to texts would probably only be a precursor to building a model predicting sentiment, which could then be applied to new data

Finally, it should be noted that there is a great detailed summary of some of the techniques applied and introduced here in the Text mining for R book, particularly the link shown.

Final Example: OnePlus Reviews

In this final example we look at applying word frequencies, comparisons between texts, sentiment analysis and wordclouds. The data set used is a series of reviews for the OnePlus phone models. Found here.

Tidy Data and Word Frequency

We will need a URL from where to download the data. You can download the data using this link

# download data
url <- "https://raw.github.com/VladAluas/Text_Analysis/master/Datasets/Text_review.csv"

# read data
reviews <- read_csv(url)
reviews

## # A tibble: 433 × 3
##    Model     Segment                            Text                            
##    <chr>     <chr>                              <chr>                           
##  1 OnePlus 1 Introduction                       "The days of the $600 smartphon…
##  2 OnePlus 1 Design, Features, and Call Quality "The OnePlus One doesn't feel l…
##  3 OnePlus 1 Design, Features, and Call Quality "Our white test unit features a…
##  4 OnePlus 1 Design, Features, and Call Quality "The 5.5-inch, 1080p IPS displa…
##  5 OnePlus 1 Design, Features, and Call Quality "There are two speaker grilles …
##  6 OnePlus 1 Design, Features, and Call Quality "With GSM (850/900/1800/1900MHz…
##  7 OnePlus 1 Design, Features, and Call Quality "Call quality, unfortunately, w…
##  8 OnePlus 1 Design, Features, and Call Quality "I also noticed a bug when it c…
##  9 OnePlus 1 Design, Features, and Call Quality "Also onboard are dual-band 802…
## 10 OnePlus 1 Performance and CyanogenMod        "The OnePlus One is powered by …
## # … with 423 more rows

Let’s have a look at the data.

# view
head(reviews)

## # A tibble: 6 × 3
##   Model     Segment                            Text                             
##   <chr>     <chr>                              <chr>                            
## 1 OnePlus 1 Introduction                       "The days of the $600 smartphone…
## 2 OnePlus 1 Design, Features, and Call Quality "The OnePlus One doesn't feel li…
## 3 OnePlus 1 Design, Features, and Call Quality "Our white test unit features a …
## 4 OnePlus 1 Design, Features, and Call Quality "The 5.5-inch, 1080p IPS display…
## 5 OnePlus 1 Design, Features, and Call Quality "There are two speaker grilles f…
## 6 OnePlus 1 Design, Features, and Call Quality "With GSM (850/900/1800/1900MHz)…

The data is structured in three columns: the model number, the segment of the review and the text from the segment.

We have chosen to keep each paragraph from each review as a separate text because it’s easier to work with, and it’s more realistic. This is most likely how you might analyse the data when you read and compare the reviews section by section.

as things stand we know that we can’t count words or quantify them in any way, so we will need to transform the last column into a more analysis friendly format. Let’s look at tokenizing our data.

# activate some libraries
library(tidytext)
library(tidyverse)

reviews %>%
# We need to specify the name of the column to be created (Word) and the source column (Text)
  unnest_tokens("Word", "Text")

## # A tibble: 30,067 × 3
##    Model     Segment      Word      
##    <chr>     <chr>        <chr>     
##  1 OnePlus 1 Introduction the       
##  2 OnePlus 1 Introduction days      
##  3 OnePlus 1 Introduction of        
##  4 OnePlus 1 Introduction the       
##  5 OnePlus 1 Introduction 600       
##  6 OnePlus 1 Introduction smartphone
##  7 OnePlus 1 Introduction aren't    
##  8 OnePlus 1 Introduction over      
##  9 OnePlus 1 Introduction quite     
## 10 OnePlus 1 Introduction yet       
## # … with 30,057 more rows

The function took all the sentences from the Text column and broke them down into a format that has one word per row and way more rows than before. So, our new data structure is one step away from a tidy format, all we need to do is count each word to see how many times it appears in the text, and then we will have a tidy format.

As mentioned in the first part of the tutorial, the function has (and does) transformed all the words to lower case and removed all the special symbols (e.g. the $ from the price described in the introduction of the OnePlus 1). This is important because it can save us a lot of headaches when cleaning the data.

Now we will transform the data in a proper tidy format. To do so, we will unnest the sentences, we will count each word, and then we can display the frequencies on a graph. Because we want to use the graph later, we will create a function, word_frequency() that contains all the steps we want to apply to the graph. We will also replace some characters, so we will not double or under count some words. In the function we also make sure to create a factor from the word column with the levels showing the most frequent words as top level, purely for aesthetic purposes.

# tokenize
reviews_tidy <- reviews %>%
  unnest_tokens("Word", "Text") %>%
  mutate(Word = str_replace(Word, "'s", "")) # prevent the analysis in showing 6t and 6t's as two separate words



# create a function that will store all the operations we will repeat several times
word_frequency <- function(x, top = 10){
  x %>%
  count(Word, sort = TRUE) %>%         #need a word count
  mutate(Word = factor(Word, levels = rev(unique(Word)))) %>% 
  top_n(top) %>%
  ungroup() %>%    # useful later if we want to use a grouping variable and will do nothing if we don't  
# The graph itself
  ggplot(mapping = aes(x = Word, y = n)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(x = NULL)
}

Let us try out this function now.

reviews_tidy %>%
  word_frequency(15)

There, frequency analysis done. Now, what does the word the say about the OnePlus brand of phones? Nothing. As mentioned earlier this is a stop word. Determiners and conjunctions (e.g. the, and, a, to) are the most frequently used words in any language and do not tell us much about the message of a sentence, not by themselves at least. These are called stop words, and we will eliminate them, so we can focus on the words that can give us a better picture of the text.

Again, we shall use a data set called stop_words that contains a list of all the determiners and conjunctions, adverbs and adjectives that we can eliminate from a text, so we can analyse it properly. Now, we can recreate the previous graph after we eliminate the stop words and see what it tells us about the OnePlus phones overall.

# Same dataset as before with an extra code line
reviews_tidy <- reviews %>%
  unnest_tokens("Word", "Text") %>%
  anti_join(stop_words, by = c("Word" = "word")) %>% # anti_join just keeps the rows common to both data sets
  mutate(Word = str_replace(Word, "'s", ""))


reviews_tidy %>%
  word_frequency(15)

#> Selecting by n

As an overall idea, we can see that the brand name (OnePlus) is the most used, as we would expect. Then, we can see phone, which is to be expected since we are talking about a product that is a phone.

We can also see that galaxy is mentioned quite a lot, just as much as camera which is again expected. OnePlus promoted themselves as a brand with high performance models at a cheaper price than a flagship from Samsung or other makers, therefore it would be only natural to see the comparison between the two.

Another pairing we see is low and light which is the part in the reviews where they are comparing camera performance in low light. Also you might have spotted that 7 and 8 are there as well. This is most likely because the 7 from all the OnePlus 7 series is mentioned quite a lot, the same goes for the 8.

Now, the graph we used earlier and have been looking at shows us the most frequently used words across all texts in the corpus. This is useful because it gives us some good insight on what are the words most associated with OnePlus as a brand overall.

But, I would also like to have the top 5 words associated with each model. We can do so by adding two lines of code to the previous chunk. It’s as simple as below.

reviews_tidy %>%
  group_by(Model) %>% 
  word_frequency(5) +
  facet_wrap(~ Model, scales = "free_y") # This is just to split the graph into multiple graphs for each model

We have a matrix of graphics that shows us which terms are most frequently associated with a model and that is very useful from a business perspective.

Now we have found ourselves in a good position. We have these graphs, and need to critiquley analyse them - now we can automate this, to determine what is the conclusion for each model.

TF-IDF in R

We will use a method called Term frequency – inverse document frequency. Note that we shall cover this in much greater detail in another tutorial (Bag of Words and Topic Modelling Tutorial). For now, we shall briefly cover this topic. We saw in the first graph, that the the most frequent terms in the review are the ones with no analytical value whatsoever, the, and, a, etc. Words that have a high analytical value (e.g. performance) will appear less often. Now, the tf_idf method works based on this principle something like this.

We can check for the words that are frequent in one review and not the others to see what distinguishes one document from another. This comparison can be done with a simple formula bind_tf_idf() that assigns weights to words using the principles below:

words with high frequency in all the documents: low weight
words with high frequency in just one of the documents and not the other: high weight
words with low frequency across the board: low weight

Let’s see this in practice.

review_tf_idf <- 
  reviews_tidy %>%
    count(Model, Word, sort = TRUE) %>%
    bind_tf_idf(Word, Model, n)
review_tf_idf %>%
  arrange(desc(tf_idf))

## # A tibble: 8,213 × 6
##    Model              Word        n     tf   idf tf_idf
##    <chr>              <chr>   <int>  <dbl> <dbl>  <dbl>
##  1 OnePlus 6T         6t         31 0.0293 2.08  0.0609
##  2 OnePlus 5T         5t         21 0.0195 2.77  0.0540
##  3 OnePlus 6          s9         18 0.0183 2.77  0.0508
##  4 OnePlus 7T McLaren mclaren    20 0.0292 1.67  0.0489
##  5 OnePlus 2          s6         11 0.0164 2.77  0.0456
##  6 OnePlus 7 Pro 5G   5g         40 0.0620 0.693 0.0430
##  7 OnePlus 7T         7t         22 0.0306 1.39  0.0425
##  8 OnePlus 8 Pro      s20        25 0.0245 1.67  0.0410
##  9 OnePlus 8          s20        25 0.0223 1.67  0.0373
## 10 OnePlus 3T         3t         21 0.0222 1.67  0.0372
## # … with 8,203 more rows

Now we can display this using plots. Note, we need to sort the data in descending order so we can create the factors for each term. We also create the factors as we did previously. Then we proceed to plotting.

review_tf_idf %>%
  arrange(desc(tf_idf)) %>%
  mutate(Word = factor(Word, levels = rev(unique(Word)))) %>%
  group_by(Model) %>%
  top_n(5) %>%
  ungroup() %>%
  ggplot(mapping = aes(x = Word, y = tf_idf, fill = Model)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = NULL) +
  coord_flip() +
  facet_wrap(~ Model, scales = "free_y")

These are the main items that separate one review from the other. Amongst them we can see the main flagship of Samsung, especially for latter reviews, they seem to compare the brands quite a lot. We can also single out that for One Plus 7 Pro 5G there is a problem with overheating and the OnePlus 6 is described as elegant.

Of course, this can be tweaked quite a bit depending on your needs. You can eliminate words, you can replace some of them, or you can add a different grouping to the analysis.

Sentiment Analysis

Now let’s dive into some sentiment analysis. In this context, it can be used in data analysis to quickly get an idea about the product - phone. We shall use a lexicon and simply associate each word in the review to a sentiment. Then it becomes a simple matter of counting how many words are associated with positive or negative sentiments to get the overall affect of the text.

Let’s proceed by using the AFINN library to check the sentiment for each model and see how they perform. We will use just the conclusion for each review as that should be the most relevant in transmitting the overall sentiment for the whole review.

However, we have to keep in mind that these being technical reviews, they might contain a terminology different from the one used in natural language, and the analysis might not be as accurate.

conclusion_afinn <- reviews %>%
  filter(str_detect(Segment, "Conclusion")) %>%
  unnest_tokens("Word", "Text") %>%
  anti_join(stop_words, by = c("Word" = "word")) %>%
# We will get the sentiments with a inner_join since the words that don't have a match, don't have a score value
  inner_join(get_sentiments("afinn"), by = c("Word" = "word"))

conclusion_afinn

## # A tibble: 122 × 4
##    Model     Segment                 Word     value
##    <chr>     <chr>                   <chr>    <dbl>
##  1 OnePlus 1 Cameras and Conclusions cut         -1
##  2 OnePlus 1 Cameras and Conclusions true         2
##  3 OnePlus 1 Cameras and Conclusions alive        1
##  4 OnePlus 1 Cameras and Conclusions true         2
##  5 OnePlus 1 Cameras and Conclusions miss        -2
##  6 OnePlus 1 Cameras and Conclusions straight     1
##  7 OnePlus 1 Cameras and Conclusions capable      1
##  8 OnePlus 1 Cameras and Conclusions free         1
##  9 OnePlus 1 Cameras and Conclusions demand      -1
## 10 OnePlus 1 Cameras and Conclusions impress      3
## # … with 112 more rows

As you can see, each token has been unnested, and assigned a sentiment value. Now, in order to check the sentiments for each review, all we need to do is add the scores and plot them.

conclusion_afinn %>%
  group_by(Model) %>%
  summarise(Score = sum(value)) %>%
  arrange(desc(Score)) %>%
  mutate(Model = factor(Model, levels = rev(unique(Model)))) %>%
  ggplot(mapping = aes(x = Model, y = Score)) +
  geom_col() +
  coord_flip() +
  labs(x = NULL)

The scores are in and overall the Oneplus 2 has the best reviews.However, what if we want to see a report on which model has the most positive and negative reviews? For that we would use the bing library.

conclusion_bing <- reviews %>%
  filter(str_detect(Segment, "Conclusion")) %>%
  unnest_tokens("Word", "Text") %>%
  anti_join(stop_words, by = c("Word" = "word")) %>%
  inner_join(get_sentiments("bing"), by = c("Word" = "word"))

conclusion_bing

## # A tibble: 189 × 4
##    Model     Segment                 Word       sentiment
##    <chr>     <chr>                   <chr>      <chr>    
##  1 OnePlus 1 Cameras and Conclusions led        positive 
##  2 OnePlus 1 Cameras and Conclusions distortion negative 
##  3 OnePlus 1 Cameras and Conclusions miss       negative 
##  4 OnePlus 1 Cameras and Conclusions dynamic    positive 
##  5 OnePlus 1 Cameras and Conclusions distortion negative 
##  6 OnePlus 1 Cameras and Conclusions warped     negative 
##  7 OnePlus 1 Cameras and Conclusions unnatural  negative 
##  8 OnePlus 1 Cameras and Conclusions admirable  positive 
##  9 OnePlus 1 Cameras and Conclusions soft       positive 
## 10 OnePlus 1 Cameras and Conclusions prefer     positive 
## # … with 179 more rows

Now we can proceed with the same steps, just add the sentiment to the grouping.

conclusion_bing %>%
  group_by(Model, sentiment) %>%
  count() %>%
  ungroup() %>%
  mutate(Model = reorder(Model, n)) %>%
  ggplot(mapping = aes(x = Model, y = n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(x = NULL, y = "Negative vs positive sentiment / Model") +
  facet_wrap(~ sentiment, ncol = 2)

For example, the OnePlus 6T and the OnePlus 7 (for China) have no negative reviews, but they also have only a few positive things said about them. This seems to be reflected in their placement in the previous graph as well.

Both these approaches have their advantages and disadvantages and in practice you will most likely use a combination of both, not just one. It is really useful to view a problem from multiple angles.

However please note that the lexicons we have used here are applied for just one word, and that can miss the sentiment of a phase (e.g. not good is a negative term, however the lexicon will see not as neutral and good as positive, therefore overall it will see it as positive). In order to avoid situations like this we can use pairing of words and check for these types of situation.

As such neagtion is not looked at here in this example.

With that in mind, we look to employ our final method - wordclouds.

Word Clouds

Wordclouds are a different approach in presenting the data. I personally find them very useful when you are trying to communicate the prevalence of a word in a text or speech. They basically have the same role as a pie chart, but they’re way better because they display data in a more user-friendly way. Using a wordcloud will allow you to look at it and see how frequent a word is without having to check and re-check a legend for dozens of times.

With that said, let’s check our wordcloud. It should show the same data as the first graph, just in a different display style, so I will use the same data set reviews_tidy. For this we will use the wordcloud package.

library(wordcloud)

reviews_tidy %>%
  count(Word) %>%
  with(wordcloud(Word, n, max.words = 100))

As you can see, the results are similar to the first analysis, the more frequent a word, the larger the font. However, with this type of graph we can include a lot more items. In this we have included 100 words, as opposed to 15 in the first graph

Let;’s see how we can use a wordcloud for sentiment analysis.

library(reshape2)

reviews_tidy %>%
  inner_join(get_sentiments("bing"), by = c("Word" = "word")) %>%
  count(Word, sentiment, sort = TRUE) %>%
  acast(Word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("#202121", "#797C80"),
                   max.words = 50)

This is a very quick and useful way to show which elements influence the sentiment for your product the most and make decisions based on it.

We can clearly see that the words that influence the most the negative scores are noise, expensive and loud while the ones that influence the positive reviews are excellent, fast and smooth.

Text Mining

An introduction to sentiment analysis in R

Pavan Singh

2022-09-17

Overview

References

Prerequisites

Part 1: Text Mining

Background and Introduction

Setup

Data Cleaning and Exploring

Tidy Data

Tokenization

Unigrams, bigrams, n-grams

Visualising Bi-grams

Example: Text Generation for Twitter

Uni-gram Based Generator

Bi-gram Based Generator

Part 2: Sentiment Analysis

Background and Introduction

Setup

Sentiment Lexicons

Changes in Sentiment Over Time

Aggregating Sentiment Over wWrds

Dealing with Negation

Example: Romeo & Juliet

Concluding Thoughts on Sentiment Analysis

Final Example: OnePlus Reviews

Tidy Data and Word Frequency

TF-IDF in R

Sentiment Analysis

Word Clouds