Abstract
Tutorial looking at an introudction to text mining methods.
This tutorial provides a foundational introduction to text mining, and in general natural language processing. We shall be covering four key topics - text mining basics, sentiment analysis, bag of words and TF-IDF. The series shall essentially be looking at a selection of methods for examining text data, manipulating it into different forms, and extracting insight from it. In this tutorial we shall mainly be using the tidytext package in R. A summary of each part is given below:
PART 1: outlines the tidy text format and the
unnest_tokens() function. We also look at and introduce
\(n\)-grams. PART 2: shows how to
perform sentiment analysis on a tidy text data set
The work in this tutorial is largely based off the working of Ian Durbach; his profile is found here. In fact, a lot of the work follows suit from Data Science for Industry course from university of Cape Town. Other great resources that are used for this tutorial are discussed below.
One of the resources is Chapter 12 of R4DS, which covers tidy data and the tidyr package.
We also shall be using Text Mining in R by Julia Silge and David Robinson. I highly recommend this book as their approach is to transform the text into a tidy format that allows you to easily analyse and visualize the results using graphs. This particular tutorial follows from chapters 1 to 4 from the book.
Finally, we shall also be making use of Chapter 20 of Data Science from Scratch covers natural language processing, and the text generation section borrows from that chapter. However the use of this resource is limited in our work now.
The document is for the most part very applied in nature, and doesn’t
assume much beyond familiarity with the R statistical computing
environment. For programming purposes, it would be useful if you are
familiar with the tidyverse, or at least dplyr
specifically.
It must be stressed that this is only a starting point, a hopefully fun foray into the world of text, not definitive statement of how you should analyze text. In fact, some of the methods demonstrated would likely be too rudimentary for most goals.
In the first part of this notebook, we shall cover
tidy data principles (using the pivot_longer() and
pivot_wider() functions from the tidyr
package)
tidytext applications to text using functions (using
unnest_tokens())
extract summaries from text data
\(n\)-grams
We shall start this tutorial by looking at “tidy” data format - a
powerful way to make handling data easier and more effective. We then
seek We shall also build a simple text generator. We then move on to
looking at tokens and the application of the unnest
function. We then move onto looking at uni-grams, bi-grams and \(n\)-grams in general. This essentially
defines a sequence of n words in a text, and can be used in
tokenizing.
Dealing with text has been of interest for some time, however it is (has) risen to be one of the hottest topics in Data Science. a lot of individuals within the community who have had rigorous exposure to applied (or theoretical) statistics are often trained to handle tabular or rectangular data that is mostly numeric, but much of the data proliferating today is unstructured and text-heavy. Many of us who work in analytically fields are not trained in even simple interpretation of natural language. Ultimately, working with text offers a plethora of new unexplored insights and opportunities to derive value.
First load the required packages for this notebook.
library(tidyverse)
library(tidytext)
library(stringr)
library(lubridate)
library(ggpubr)
library(wordcloud)
options(repr.plot.width=4, repr.plot.height=3) # set plot size in the notebook
In this and the following parts of this tutorial, we’ll be using a
data set containing all of Donald Trump’s tweets. An archive of his
tweets was maintained here. These
can be downloaded as zipped JSON files. This repository only contains
tweets up to the end of 2018 and has at the time of writing not been
updated in several months. The available data have already been put into
an .RData file, but should the repository ever be updated
again, you can download and unzip the
condensed_20xx.json.zip files and use the code below.
library(jsonlite)
tweets <- data.frame()
for(i in 2009:2018){
x <- fromJSON(txt=paste0('data/trump_tweets_all/condensed_',i,'.json'), simplifyDataFrame = T) %>% map_df(rev)
tweets <- rbind.data.frame(tweets, rev(x))
}
rm(x)
save(tweets,file='data/trump-tweets.RData')
Having loaded the data, which consists of a single data frame called
tweets, and examine the contents:
load('trump-tweets.RData')
str(tweets)
## tibble [36,307 × 8] (S3: tbl_df/tbl/data.frame)
## $ is_retweet : logi [1:36307] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ favorite_count : int [1:36307] 202 3 2 27 1950 13 10 6 2 5 ...
## $ in_reply_to_user_id_str: chr [1:36307] NA NA NA NA ...
## $ retweet_count : int [1:36307] 253 2 3 8 1421 10 11 3 1 3 ...
## $ created_at : chr [1:36307] "Mon May 04 18:54:25 +0000 2009" "Tue May 05 01:00:10 +0000 2009" "Fri May 08 13:38:08 +0000 2009" "Fri May 08 20:40:15 +0000 2009" ...
## $ text : chr [1:36307] "Be sure to tune in and watch Donald Trump on Late Night with David Letterman as he presents the Top Ten List tonight!" "Donald Trump will be appearing on The View tomorrow morning to discuss Celebrity Apprentice and his new book Th"| __truncated__ "Donald Trump reads Top Ten Financial Tips on Late Show with David Letterman: http://tinyurl.com/ooafwn - Very funny!" "New Blog Post: Celebrity Apprentice Finale and Lessons Learned Along the Way: http://tinyurl.com/qlux5e" ...
## $ id_str : chr [1:36307] "1698308935" "1701461182" "1737479987" "1741160716" ...
## $ source : chr [1:36307] "Twitter Web Client" "Twitter Web Client" "Twitter Web Client" "Twitter Web Client" ...
We’ll start by turning the data frame into a tibble, and turning the
date into a format that will be easier to work with later on. We use the
parse_datetime() function from the
lubridate package. Note, that we do not look into
processing dates and times in this tutorial, but if you are interested
the relevant chapter of R4DS is here. An example
of the parsed date is shown below.
# turn into a tibble
tweets <- as_tibble(tweets)
# parse the date
tweets <- tweets %>% mutate(date = parse_datetime(str_sub(tweets$created_at,5,30), '%b %d %H:%M:%S %z %Y'))
tweets$date[1]
## [1] "2009-05-04 18:54:25 UTC"
Once the dates and times have been appropriately parsed we can perform various operations on them. For example, below we work out the earlier and latest tweets in the data set, and the duration of time covered by the data set.
paste("Earliest tweet date:",min(tweets$date))
## [1] "Earliest tweet date: 2009-05-04 18:54:25"
paste("Latest tweet date:",max(tweets$date))
## [1] "Latest tweet date: 2018-12-31 23:53:06"
paste("Duration of time covered in this data set is", round(max(tweets$date) - min(tweets$date),2))
## [1] "Duration of time covered in this data set is 3528.21"
How many months are covered in this period?
n_months <- interval(min(tweets$date), max(tweets$date)) %/% months(1) + 1 #Add 1 to determine how MANY months were represented
n_months
## [1] 116
Let’s look at a few tweets. The \(slice_sample()\) function randomly selects (\(n=5\)) rows.
set.seed(5073)
sample_tweets <- tweets %>% slice_sample(n = 3) %>% select(date,text)
sample_tweets
## # A tibble: 3 × 2
## date text
## <dttm> <chr>
## 1 2017-01-27 23:46:22 "I promise that our administration will ALWAYS have your …
## 2 2015-01-05 03:49:46 "\"@NicoleAMarin: You're Fired! It's music to my ears whe…
## 3 2012-09-28 18:00:42 "\"Trump buys mansion adjacent to family winery\" http://…
We start by plotting the number of tweets Trump has made over time. Retweets are shown in blue.
ggplot(tweets, aes(x = date, fill = is_retweet)) +
geom_histogram(position = 'identity', bins = n_months, show.legend = FALSE)
Most of the tweets are posted by Trump himself, and very few are
retweets. Given that trump only started retweeting from 2016, it begs
the question when did retweeting start? was it even available before
2016? On November 05, 2009 Twitter started a limited roll out of the
‘retweet’ feature to its users. So Trump simply did not choose to
retweet prior to 2016. Even till present date, we see that Trump more
often than not tweets himself, instead of retweeting.
Looking back at the sample tweets we had, shown below:
sample_tweets[,2]
## # A tibble: 3 × 1
## text
## <chr>
## 1 "I promise that our administration will ALWAYS have your back. We will ALWAYS…
## 2 "\"@NicoleAMarin: You're Fired! It's music to my ears when it comes from @rea…
## 3 "\"Trump buys mansion adjacent to family winery\" http://t.co/sTJXhgbK via @t…
We see from the few example tweets above that tweets have a
particular format, but that that format is quite hard to pin down. There
are special characters like @ and # that mean
something, some tweets have links while others do not, some are a single
sentence and others are several, some mention people by name, and so on.
This begs the question - How are we going to structure this data in a
meaningful way? To answer this question, we’re going to go back a bit
and talk about “tidy” data.
Tidy data is basically just a way of consistently organizing your data that often makes subsequent analysis easier, particularly if you are using tidyverse packages. Getting your data into this format requires some upfront work, but that work pays off in the long term. Tidy data has three requirements:
This only really makes sense once you work out what the variables and observations are. Of course you can’t define a variable as “what’s in a column”, or all data will be tidy!
Often the two main problems may be that:
One variable might be spread across multiple columns.
One observation might be scattered across multiple rows.
In general, to fix these problems, you’ll need the two most important
functions in tidyr: pivot_longer() and pivot_wider().
For example, suppose we have the following data frame, containing movie ratings for 3 users and 4 movies:
untidy_ratings <- tribble(
~user_id, ~age, ~city, ~movie1, ~movie2, ~movie3,
1, 49, 'Cpt', 5, NA, NA,
2, 20, 'Cpt', 3, 3, 1,
3, 30, 'Jhb', NA, 5, 1
)
untidy_ratings
## # A tibble: 3 × 6
## user_id age city movie1 movie2 movie3
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 49 Cpt 5 NA NA
## 2 2 20 Cpt 3 3 1
## 3 3 30 Jhb NA 5 1
This data frame is not tidy because movie ratings are spread across multiple columns.
The “tidy” way to think about this data is that the variables are the
user_id, the user’s demographic variables
(age, city), the title of the
movie, and the rating given. At the moment some column
names (movie1, movie2, movie3)
are actually values of the variable. This is a common problem
with untidy data. The observations in this case are user-movie
combinations. Note that this isn’t to say that “untidy” data is never
useful - in fact, data in the format above is often used when building
recommender systems. But there are advantages to working with tidy data,
especially if you are using packages from the tidyverse.
To make the data frame tidy, we use the pivot.longer()
function from the tidyr package, which provides a
number of functions for getting data into (and out of) tidy format.
Note: Previously the function
gather()would have been used for this purpose, but this function has been “retired”.
tidy_ratings <- untidy_ratings %>% pivot_longer(c(movie1, movie2, movie3), names_to = 'title', values_to = 'rating')
tidy_ratings %>% head(5)
## # A tibble: 5 × 5
## user_id age city title rating
## <dbl> <dbl> <chr> <chr> <dbl>
## 1 1 49 Cpt movie1 5
## 2 1 49 Cpt movie2 NA
## 3 1 49 Cpt movie3 NA
## 4 2 20 Cpt movie1 3
## 5 2 20 Cpt movie2 3
This data frame shows all user-movie combinations. Those observations
with no rating (because the user has not seen the movie) are given an
NA. We say that missing values are explicitly
represented. Another way of representing missing values is to omit the
corresponding observation from the data frame. This is an
implicit representation of a missing value. To use the implicit
representation we just set values_drop_na = TRUE when using
pivot_longer().
tidy_ratings_imp <- untidy_ratings %>% pivot_longer(c(movie1, movie2, movie3), names_to = 'title', values_to = 'rating', values_drop_na = T)
tidy_ratings_imp
## # A tibble: 6 × 5
## user_id age city title rating
## <dbl> <dbl> <chr> <chr> <dbl>
## 1 1 49 Cpt movie1 5
## 2 2 20 Cpt movie1 3
## 3 2 20 Cpt movie2 3
## 4 2 20 Cpt movie3 1
## 5 3 30 Jhb movie2 5
## 6 3 30 Jhb movie3 1
And we can get back to our original data frame with missing values (NA), by using complete function.
tidy_ratings_imp %>% complete(nesting(user_id, age, city), title) %>% head(5)
## # A tibble: 5 × 5
## user_id age city title rating
## <dbl> <dbl> <chr> <chr> <dbl>
## 1 1 49 Cpt movie1 5
## 2 1 49 Cpt movie2 NA
## 3 1 49 Cpt movie3 NA
## 4 2 20 Cpt movie1 3
## 5 2 20 Cpt movie2 3
We can move in the opposite direction (spreading a variable across
multiple columns) using the pivot_wider() function.
untidy_ratings <- tidy_ratings %>% pivot_wider(names_from = 'title', values_from = 'rating')
untidy_ratings
## # A tibble: 3 × 6
## user_id age city movie1 movie2 movie3
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 49 Cpt 5 NA NA
## 2 2 20 Cpt 3 3 1
## 3 3 30 Jhb NA 5 1
Note that its not always the case that a “long” data format is tidy and a “wide” format is not. For example, the data frame below is tidy.
# A tibble: 6 x 4
town month avgtemp rainfall
<chr> <chr> <dbl> <dbl>
1 A Jan 24 12
2 B Jan 27 10
3 C Jan 30 16
4 A Jun 14 22
5 B Jun 20 62
6 C Jun 5 16
but it would not be tidy if we reshape the data into a “longer” format:
# A tibble: 12 x 4
town month weather value
<chr> <chr> <chr> <dbl>
1 A Jan avgtemp 24
2 B Jan avgtemp 27
3 C Jan avgtemp 30
4 A Jun avgtemp 14
5 B Jun avgtemp 20
6 C Jun avgtemp 5
7 A Jan rainfall 32
8 B Jan rainfall 30
9 C Jan rainfall 36
10 A Jun rainfall 22
11 B Jun rainfall 22
12 C Jun rainfall 16
There are several more examples in Chapter 12 of R4DS, which covers tidy data and the tidyr package. It also covers the situation when we have one column that contains two variables.
Ultimately, note that in general we cannot say that data in long format is tidy and wide format is untidy, per say, as we have seen from the above example. In general, it often depends on your use case and the data you need.
Tidy text is defined by the authors of tidytext as “a table with one-token-per-row” (see Chapter 1 of TMR).
A token is a whatever unit of text is meaningful for your analysis: it could be a word, a word pair, a sentence, a paragraph, a chapter, whatever.
That means that the process of getting text data tidy is largely a matter of
Tokenization is the process of splitting text up
into the units that we are interested in analyzing. The
unnest_tokens() function performs tokenization by splitting
text up into the required tokens, and creating a new data frame with one
token per row i.e. tidy text data.
Refering back to the example we have introduced; suppose we want to
analyze the individual words Trump uses in his tweets. We do this by
unnest_tokens(tweets, word, text, token = 'words'). Note
the arguments passed to unnest_tokens():
tweets)word)text i.e. the tweets are in tweets$text)token = 'words')So below when we implement tokenization with
unnest_tokens(), the first argument: is the tibble, second:
output (what object we are creating), third: which object in the tibble
are we created this from, and lastly: what is the tokenization rule.
unnest_tokens(sample_tweets, word, text, token = 'words') %>% head(6)
## # A tibble: 6 × 2
## date word
## <dttm> <chr>
## 1 2017-01-27 23:46:22 i
## 2 2017-01-27 23:46:22 promise
## 3 2017-01-27 23:46:22 that
## 4 2017-01-27 23:46:22 our
## 5 2017-01-27 23:46:22 administration
## 6 2017-01-27 23:46:22 will
We note that some stuff comes up as words that we don’t want, like “https” and some links etc. Let’s see what happens if we tokenize by sentences:
unnest_tokens(sample_tweets, sentences, text, token = 'sentences') %>% select(sentences, everything()) %>% head(5)
## # A tibble: 5 × 2
## sentences date
## <chr> <dttm>
## 1 "i promise that our administration will always have your … 2017-01-27 23:46:22
## 2 "we will always be with you!" 2017-01-27 23:46:22
## 3 "https://t.co/d0aowhoh4x" 2017-01-27 23:46:22
## 4 "\"@nicoleamarin: you're fired!" 2015-01-05 03:49:46
## 5 "it's music to my ears when it comes from @realdonaldtrum… 2015-01-05 03:49:46
A really nice feature of tidytext is that you can
tokenize by regular expressions. This gives you a lot of flexibility in
deciding what you want a token to constitute. For example, for analyzing
tweet data we can explicitly include symbols like @ and
# that mean something, and exclude symbols like
? and ! that don’t add anything (unless we
want to include them, in which case we can!)
unnest_tokens(sample_tweets, word, text, token = 'regex', pattern = "[^\\w_#@']") %>% head(6)
## # A tibble: 6 × 2
## date word
## <dttm> <chr>
## 1 2017-01-27 23:46:22 i
## 2 2017-01-27 23:46:22 promise
## 3 2017-01-27 23:46:22 that
## 4 2017-01-27 23:46:22 our
## 5 2017-01-27 23:46:22 administration
## 6 2017-01-27 23:46:22 will
We’re now in a position to transform the full set of tweets into tidy text format. We’ll tokenize with the regular expression we used in the last example above.
unnest_reg <- "[^\\w_#@']"
tidy_tweets <- tweets %>%
mutate(text, text = str_replace_all(text, "’", "'")) %>% #replace curly apostrophe with straight
unnest_tokens(word, text, token = 'regex', pattern = unnest_reg) %>%
select(date, word, favorite_count)
Let’s plot the most commonly used words:
tidy_tweets %>%
count(word, sort = TRUE) %>%
filter(rank(desc(n)) <= 20) %>%
ggplot(aes(reorder(word, n), n)) + geom_col() + coord_flip() + xlab('')
Not very useful! What’s happening here, unsurprisingly, is that common words like “the”, “to”, etc are coming up most often. We can tell tidytext to ignore these words (which are called “stop words”). Let’s have a look at a sample of stop words contained in the dictionary used by tidytext:
sort(sample(stop_words$word,15))
## [1] "different" "for" "hadn't" "here's" "i'll"
## [6] "in" "interested" "isn't" "oh" "ordering"
## [11] "seems" "tends" "ways" "wherever" "you"
There are also some words like “http” and “www” that are not stop
words but that we don’t want. We need to remove these “manually” using
str_replace_all() with a regular expression of our choice.
The regular expression below is a bit more complex than what we’ve used
before.
We look at a few examples below. Play around with the input tweets and the regular expression to get a good idea of what everything does.
# pattern that we want to remove (replace with nothing)
replace_reg <- '(https?:.*?(\\s|.$))|(www.*?(\\s|.$))|&|<|>'
# tweet with a www. link
tweets$text[145]
str_replace_all(tweets$text[145], replace_reg, '')
# tweet with an http link at end of tweet
tweets$text[194]
str_replace_all(tweets$text[194], replace_reg, '')
# tweet with multiple &
tweets$text[36185]
str_replace_all(tweets$text[36185], replace_reg, '')
The basic idea is:
https?: finds http: or
https:..* finds the longest match, which we don’t want (why?),
so we use .*?, which finds the shortest
match.(\\s|.$)) says “go
till you hit a space or the end of the string”.www.&, <, >
sometimes appear in the tweets to indicate &,
<, > respectively.Now we can put it all together – first remove the retweets (we only
want to look at Trump’s own tweets), then remove the non-words,
unnest into words, and remove stop words.
# pattern that we want to remove (replace with nothing)
replace_reg <- '(https?:.*?(\\s|.$))|(www.*?(\\s|.$))|&|<|>'
unnest_reg <- "[^\\w_#@']"
tidy_tweets <- tweets %>%
mutate(text, text = str_replace_all(text, "’", "'")) %>% #replace curly apostrophe with straight
filter(is_retweet == FALSE) %>% #remove retweets
mutate(text = str_replace_all(text, replace_reg, '')) %>% #remove with a reg exp (http, www, etc)
unnest_tokens(word, text, token = 'regex', pattern = unnest_reg) %>% #unnest tokens
filter(!word %in% stop_words$word, str_detect(word, '[a-z]')) %>% #remove stop words and tokens without letters
select(date,word,favorite_count)
We again plot the most commonly used tokens in our newly-cleaned data frame.
tidy_tweets %>%
count(word, sort = TRUE) %>%
filter(rank(desc(n)) <= 20) %>%
ggplot(aes(reorder(word, n), n)) + geom_col() + coord_flip() + xlab('')
It turns out Trump likes tweeting about himself, mostly.
We can also see whether being president has changed the most commonly
used words he uses. To do this we first create a new variable, a binary
indicator of whether a tweet was made before or after Trump became
president. We do this by comparing the date of the tweet to the date of
the US election (8th November 2016). Note that once we have the date in
a recognized format like ymd or dmy, this
comparison can be trivially done.
tidy_tweets <- tidy_tweets %>% mutate(is_potus = (date > ymd(20161108)))
options(repr.plot.width=6, repr.plot.height=5) # make plot size bit bigger for next plots
tidy_tweets %>%
group_by(is_potus) %>%
count(word, sort = TRUE) %>%
filter(rank(desc(n)) <= 20) %>%
ggplot(aes(reorder(word, n), n, fill = is_potus)) + geom_col() + coord_flip() + xlab('')
Interesting insights are brought out here. He only started mentioning
words like fake (news), democrats and military for example when he
became president. He in fact did speak about Obama (or rather mention
that name) quite often when he was not president and not when he
was.
The plot above is a bit unsatisfying - there are obviously a lot more tweets pre-presidency, and that makes it difficult to see what is happening in the post-presidency frequencies. Below we transform the absolute frequencies into relative ones and plot those.
total_tweets <- tidy_tweets %>%
group_by(is_potus) %>%
summarise(total = n())
tidy_tweets %>%
group_by(is_potus) %>%
count(word, sort = TRUE) %>% #count the number of times word used
left_join(total_tweets, by = 'is_potus') %>% #add the total number of tweets made (pre- or post-potus)
mutate(freq = n/total) %>% #add relative frequencies
filter(rank(desc(freq)) < 20) %>%
ggplot(aes(reorder(word, freq), freq, fill = is_potus)) +
geom_col() +
coord_flip() +
xlab('') +
facet_grid(.~is_potus)
Below we show a wordcloud of Trump’s tweets after he became president. Wordclouds are not particularly informative – they just plot words proportional to their frequency of use and position them in an attractive way. This uses the wordcloud package.
tidy_tweets %>%
filter(is_potus == TRUE) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
So far we’ve considered words as individual units. Using this we can simply expand on considering their their relationships to sentiments or to documents. However, many interesting text analyses are based on the relationships between words, whether examining which words tend to follow others immediately, or that tend to co-occur within the same documents.
Now, we introduce and lore some of the methods tidytext offers for calculating and visualizing relationships between words in your text dataset. This includes the token = “ngrams” argument, which tokenizes by pairs of adjacent words rather than by individual ones. More information, found here.
An n-gram is a sequence of n words in a text. Uni-grams are single words, bi-grams are pairs of adjacent words, tri-grams are sequences of three words, and so on.
The unnest_tokens() function allows you to easily
extract n-grams using the “n-grams” token.
# bigrams
unnest_tokens(sample_tweets, bigram, text, token = 'ngrams', n = 2) %>% head(6)
## # A tibble: 6 × 2
## date bigram
## <dttm> <chr>
## 1 2017-01-27 23:46:22 i promise
## 2 2017-01-27 23:46:22 promise that
## 3 2017-01-27 23:46:22 that our
## 4 2017-01-27 23:46:22 our administration
## 5 2017-01-27 23:46:22 administration will
## 6 2017-01-27 23:46:22 will always
Let’s try tokenize three words (tri-grams).
# trigrams
unnest_tokens(sample_tweets, bigram, text, token = 'ngrams', n = 3) %>% head(6)
## # A tibble: 6 × 2
## date bigram
## <dttm> <chr>
## 1 2017-01-27 23:46:22 i promise that
## 2 2017-01-27 23:46:22 promise that our
## 3 2017-01-27 23:46:22 that our administration
## 4 2017-01-27 23:46:22 our administration will
## 5 2017-01-27 23:46:22 administration will always
## 6 2017-01-27 23:46:22 will always have
We extract the full set of bi-grams below, and do the same cleaning
up we did for unigrams earlier. Removing stop words is a bit trickier
with bi-grams. We need to separate each bi-gram into its constituent
words, remove the stop words, and then put the bi-grams back together
again. Both separate() and unite() are
tidyr functions.
replace_reg <- '(https?:.*?(\\s|.$))|(www.*?(\\s|.$))|&|<|>'
# tokenization
tweet_bigrams <- tweets %>%
filter(is_retweet == FALSE) %>%
mutate(text = str_replace_all(text, replace_reg, '')) %>%
unnest_tokens(bigram, text, token = 'ngrams', n = 2)
# separate the bigrams
bigrams_separated <- tweet_bigrams %>%
separate(bigram, c('word1', 'word2'), sep = ' ')
# remove stop words
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word)
# join up the bigrams again
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = ' ')
We can now see what the most common bi-grams are. Note that these are very different from the most common words. That is, they definitely provide different and useful information over and above what the unigrams did.
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE) %>%
filter(rank(desc(n)) <= 10) %>%
na.omit() #if a tweet contains just one word, then the bigrams will return NA
bigram_counts %>% head(5)
## # A tibble: 5 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 donald trump 1105
## 2 fake news 339
## 3 crooked hillary 270
## 4 hillary clinton 262
## 5 white house 260
Bi-grams, groups of two words, which were used together consecutively, that appeared most often after removing all of the stop words, and other non-words stuff.
This one-bigram-per-row format is helpful for exploratory analyses of the text. We may be interested in visualizing all of the relationships among words simultaneously, rather than just the top few at a time. As one common visualization, we can arrange the words into a network, or “graph.” Here we’ll be referring to a “graph” not in the sense of a visualization, but as a combination of connected nodes. A graph can be constructed from a tidy object since it has three variables:
from: the node an edge is coming from to: the node an edge is going towards *weight: A numeric value associated with each edge
The igraph package has many powerful functions for
manipulating and analyzing networks. One way to create an igraph object
from tidy data is the graph_from_data_frame() function,
which takes a data frame of edges with columns for “from”, “to”, and
edge attributes (in this case n):
library(igraph)
# filter for only relatively common combinations
bigram_graph <- bigram_counts %>%
filter(n > 20) %>%
graph_from_data_frame()
bigram_graph
## IGRAPH 01ac3e4 DN-- 16 9 --
## + attr: name (v/c), n (e/n)
## + edges from 01ac3e4 (vertex names):
## [1] donald ->trump fake ->news crooked ->hillary
## [4] hillary ->clinton white ->house celebrity->apprentice
## [7] president->obama trump ->tower witch ->hunt
We can now use the ggraph package which has
visualization methods for graph objects. This package implements these
visualizations in terms of the grammar of graphics, which we are already
familiar with from ggplot2.
We can convert an igraph object into a ggraph with the
ggraph function, after which we add layers to it, much like
layers are added in ggplot2. For example, for a basic graph
we need to add three layers: nodes, edges, and text.
library(ggraph)
set.seed(2017)
ggraph(bigram_graph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
The above shows common bi-grams in Donald Trump’s tweets, showing those that occurred more than 20 times and where neither word was a stop word. We can visualize some details of the text structure. For example, we see that tower and donald have common center of nodes, trump. This makes sense. Other than that there is not much we can dive into.
Note we can make some visual improvements on the above figure. We,
add the edge_alpha aesthetic to the link layer to make
links transparent based on how common or rare the bi-gram is. We add
directionality with an arrow, constructed using
grid::arrow(), including an end_cap option that tells the
arrow to end before touching the node. We tinker with the options to the
node layer to make the nodes more attractive (larger, blue points). We
add a theme that’s useful for plotting networks,
theme_void().
set.seed(2020)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
It may take some experimentation with ggraph to get your networks into a presentable format like this, but the network structure is useful and flexible way to visualize relational tidy data.
Note that this is a visualization of a Markov chain, a common model in text processing. In a Markov chain, each choice of word depends only on the previous word.
More useful methods and techniques are found in the Text Mining for R textbook, here, with \(n\)-grams.
One area of natural language processing is (natural language) generation, roughly speaking an attempt to generate realistic looking and sounding text and speech from some data or knowledge base. In this section we’ll build a model for generating Trump-like tweets. We do this by looking at sequences of words (or more generally n-grams) that Trump uses.
We begin by generating Trump-like sentences. We do this by:
We still need to turn sentences into tweets - a simple way is just to generate sentences until you are out of space (140 characters).
Previously, full stops were included as separating characters i.e. we
split a string wherever we encountered a full stop (see
unnest_reg above, and compare below). Now we want to treat
a full stop as equivalent to a word. We thus need to create a new tidy
tweets data frame that does this. So basically, since sentences end with
a full stop, so we have to kind of see the full stop as a word now,
since it is now something that can follow a previous word.
We also added some other things for cleaning of the data purposes Like for the word “don’t” we drop the apostrophe. Also “Donald J.Trump” comes up often too, so we shall remove that full stop since we don’t want it to interfere with what we have just done by making it a word.
unnest_reg <- "[^\\w_#@'\\.]" #note adding a fullstop to the reg exp used before
tidy_tweets_wstop <- tweets %>%
filter(is_retweet == FALSE) %>%
mutate(text = str_replace_all(text, "[Dd]on't", 'dont')) %>% #some additional cleaning
mutate(text = str_replace_all(text, '(j\\.)|(J\\.)', 'j')) %>% #some additional cleaning
mutate(text = str_replace_all(text, replace_reg, '')) %>%
mutate(text = str_replace_all(text, '\\.', ' \\.')) %>% #add a space before fullstop so counted as own word
unnest_tokens(word, text, token = 'regex', pattern = unnest_reg) %>%
select(id_str, date, word) %>%
group_by(id_str) %>% #group words by tweet
mutate(next_word = lead(word)) #the "lead" and "lag" operators can be very useful!
Let’s have a look at the data frame we just constructed.
head(tidy_tweets_wstop, 6)
## # A tibble: 6 × 4
## # Groups: id_str [1]
## id_str date word next_word
## <chr> <dttm> <chr> <chr>
## 1 1698308935 2009-05-04 18:54:25 be sure
## 2 1698308935 2009-05-04 18:54:25 sure to
## 3 1698308935 2009-05-04 18:54:25 to tune
## 4 1698308935 2009-05-04 18:54:25 tune in
## 5 1698308935 2009-05-04 18:54:25 in and
## 6 1698308935 2009-05-04 18:54:25 and watch
We now count the number of times that each word pair (word followed by next_word) occurs. This is the same as a bi-gram frequency count we did above, just that we’ve included full stops.
transitions <- tidy_tweets_wstop %>%
group_by(word,next_word) %>%
count() %>%
arrange(desc(n)) %>%
ungroup() # remember to ungroup else later steps are slow!
transitions
## # A tibble: 234,662 × 3
## word next_word n
## <chr> <chr> <int>
## 1 . <NA> 10344
## 2 . . 4701
## 3 thank you 2239
## 4 of the 2214
## 5 will be 1880
## 6 in the 1624
## 7 is a 1260
## 8 a great 1163
## 9 donald trump 1110
## 10 . i 1063
## # … with 234,652 more rows
We see a lot of full stops followed by NA. And a lot of full stops followed by full stops.
The last full stop of every tweet is followed by an NA.
This causes problems later on, so we replace the NA with
another full stop.
transitions$next_word[is.na(transitions$next_word)] <- '.'
Finally, we unleash our uni-gram based Trump tweeter. This model randomly samples from the list of next words. It continues to create sentences until the tweet is over 140 characters and full stop is reached. Thus the tweets it generates will be too long (more than 140 characters).
# trump v1, unigram model, random sampling
set.seed(5073)
# start at the end of a sentence (so next word is a start word)
current <- '.'
result <- '.'
keep_going <- TRUE
while(keep_going == TRUE){
# get next word
next_word <- transitions %>%
filter(word == as.character(current)) %>%
slice_sample(n = 1) %>% # random sampling
select(next_word)
# combine with result so far
result <- str_c(result,' ', next_word)
current <- next_word
# does the current word appear in the 'word' column?
n_current <- sum(transitions$word == as.character(current))
# keep going if can look up current word and tweet is < 140 or current word is not .
keep_going <- ifelse(n_current == 0, FALSE,
ifelse(nchar(result) < 140, TRUE,
ifelse(str_detect(current,'\\.'), FALSE, TRUE)))
}
# show text generation
result
## [1] ". best cast or made #betterwithfriends . hosting let everyone i drove into any clue someone should thank steve . where will washington . being incorrect ill say isis would fire that garbage alcohol in millions restoring fiscal issues that we ran him ok but obstructionists and im pretty weak america join together our multi faceted transactions i walked past cycle ."
The generated tweets have traces of Trump but are more or less gibberish. We can try improve the generation model by changing the sampling to be proportional to the bi-gram frequency rather than random.
# trump v2, unigram model, sample using transition probabilities
set.seed(5073)
# start at the end of a sentence (so next word is a start word)
current <- '.'
result <- '.'
keep_going <- TRUE
while(keep_going == TRUE){
# get next word
next_word <- transitions %>%
filter(word == as.character(current)) %>%
slice_sample(n = 1, weight_by = n) %>% # proportional to count
select(next_word)
# combine with result so far
result <- str_c(result,' ', next_word)
current <- next_word
# does the current word appear in the 'word' column?
n_current <- sum(transitions$word == as.character(current))
# keep going if can look up current word and tweet is < 140 or current word is not .
keep_going <- ifelse(n_current == 0, FALSE,
ifelse(nchar(result) < 140, TRUE,
ifelse(str_detect(current,'\\.'), FALSE, TRUE)))
}
result
## [1] ". make a fundraiser tonight at trump make you forgot to issue on food stamp policies have not want to usa pageant in congress travelling on april s ability of a terrorist pan handle the public about gadhafi ."
Its hard to tell, but from a few runs it looks a little bit better but still not great.
Let’s see if we do any better if we look at transitions between bigrams rather than between words. As for the unigram model, we first need to add full stops, and then count how many times each transition (now between pairs of bigrams) occurs.
# we've already lagged unigrams with full stops before, use these to create lagged bigrams
bigrams_wstop <- tidy_tweets_wstop %>%
filter(!is.na(word) & !is.na(next_word)) %>% #don't want to unite with NAs
unite(bigram, word, next_word, sep = ' ') %>%
mutate(next_bigram = lead(bigram, 2))
bigrams_wstop
## # A tibble: 638,248 × 4
## # Groups: id_str [35,182]
## id_str date bigram next_bigram
## <chr> <dttm> <chr> <chr>
## 1 1698308935 2009-05-04 18:54:25 be sure to tune
## 2 1698308935 2009-05-04 18:54:25 sure to tune in
## 3 1698308935 2009-05-04 18:54:25 to tune in and
## 4 1698308935 2009-05-04 18:54:25 tune in and watch
## 5 1698308935 2009-05-04 18:54:25 in and watch donald
## 6 1698308935 2009-05-04 18:54:25 and watch donald trump
## 7 1698308935 2009-05-04 18:54:25 watch donald trump on
## 8 1698308935 2009-05-04 18:54:25 donald trump on late
## 9 1698308935 2009-05-04 18:54:25 trump on late night
## 10 1698308935 2009-05-04 18:54:25 on late night with
## # … with 638,238 more rows
We now calculate the frequency count for each bigram-to-next_bigram transition. We first remove any rows where either bigram or next_bigram is missing.
# transition matrix
bigram_transitions <- bigrams_wstop %>%
filter(!is.na(bigram) & !is.na(next_bigram)) %>%
group_by(bigram,next_bigram) %>%
count() %>%
arrange(desc(n)) %>%
ungroup() # remember to ungroup else later steps are slow!
bigram_transitions
## # A tibble: 497,313 × 3
## bigram next_bigram n
## <chr> <chr> <int>
## 1 . . . . 873
## 2 make america great again 456
## 3 the u .s . 440
## 4 art of the deal 151
## 5 the art of the 137
## 6 . think like a 109
## 7 . thank you . 103
## 8 please run for president 101
## 9 i will be interviewed 99
## 10 think like a champion 89
## # … with 497,303 more rows
In the bigram model, a full stop is one “word” in a bigram. So we can’t start with only a full stop, like we did before. The approach we’ll take is to randomly select one of the bigrams Trump has used.
set.seed(5073)
# extract all starting rows
start_bigrams <- bigrams_wstop %>%
group_by(id_str) %>%
slice_head(n = 1) %>% # takes the first row from each tweet
ungroup()
# choose one starting bigram
current <- start_bigrams %>%
slice_sample(n = 1) %>%
select(bigram) %>%
as.character()
current
## [1] "great job"
Finally we can put everything together to generate a tweet using the bigram model we just created.
# trump v3, bigram model, sample using transition probabilities
# start with starting bigram previously generated
result <- current
keep_going <- TRUE
while(keep_going == TRUE){
# get next bigram
next_bigram <- bigram_transitions %>%
filter(bigram == as.character(current)) %>%
slice_sample(n = 1, weight_by = n) %>%
select(next_bigram)
# combine with result so far
result <- str_c(result,' ', next_bigram)
current <- next_bigram
# does the current bigram appear in the bigram column?
n_current <- sum(bigram_transitions$bigram == as.character(current))
# keep going if can look up current bigram and tweet is < 140 or current bigram
# does not contain a .
keep_going <- ifelse(n_current == 0, FALSE,
ifelse(nchar(result) < 140, TRUE,
ifelse(str_detect(current,'\\.'), FALSE, TRUE)))
}
result
## [1] "great job on completing our new secretary of defense james mattis and general chief of staff john kelly who is tough on crime border military vets and the @senategop we are appointing high quality federal district . ."
Sentiment analysis is the study of the emotional content of a body of text. In this section, we shall provide an introduction to sentiment analysis, in which we will cover several things:
We end of this section by using all the ideas above to to analyze the emotional content of Donald Trump’s tweets and examine how these have changed over time.
Note that, Chapter 2 of Text Mining in R covers sentiment analysis, and negation is handled in Chapter 4. Many of the ideas and some of the code in this workbook are drawn from these chapters.
A common and intuitive approach to text is sentiment analysis. In a grand sense, we are interested in the emotional content of some text, e.g. posts on Facebook, tweets, or movie reviews. Most of the time, this is obvious when one reads it, but if you have hundreds of thousands or millions of strings to analyze, you’d like to be able to do so efficiently.
When we read text, as humans, we infer the emotional content from the words used in the text, and some more subtle cues involving how these words are put together. Sentiment analysis tries to do the same thing algorithmically.
One way of approaching the problem is to assess the sentiment of individual words, and then aggregate the sentiments of the words in a body of text in some way. For example, if we can classify whether each word is positive, negative, or neutral, we can count up the number of positive, negative, and neutral words in the document and define that as the sentiment of the document. This is just one way - a particularly simple way - of doing document-level sentiment analysis.
When assessing the sentiment or emotional content of individual words, we usually make use of existing sentiment dictionaries (or “lexicons”) that have already done this using some kind of manual classification.
note: a token simply represnents a unit of text. If you put those together you have document and if you have a bunch of documents, then you have a corpus.
First load the packages we need for this section:
library(tidyverse)
library(tidytext)
library(textdata)
library(stringr)
library(lubridate)
options(repr.plot.width=4, repr.plot.height=3) # set plot size in the notebook
We shall be using the Trump tweet data we used in the earlier part of this tutorial. Remember, this data contains the tweets from Trump. We need to get the data into tidy text format. These are the same operations we did in the previous section.
load('trump-tweets.RData')
# make data a tibble
tweets <- as_tibble(tweets)
# parse the date and add some date related variables
tweets <- tweets %>%
mutate(date = parse_datetime(str_sub(tweets$created_at, 5, 30), '%b %d %H:%M:%S %z %Y')) %>%
mutate(is_potus = (date > ymd(20161108))) %>%
mutate(month = make_date(year(date), month(date)))
# turn into tidy text
replace_reg <- '(http.*?(\\s|.$))|(www.*?(\\s|.$))|&|<|>'
unnest_reg <- "[^\\w_#@']"
tidy_tweets <- tweets %>%
filter(is_retweet == FALSE) %>% #remove retweets
mutate(text = str_replace_all(text, replace_reg, '')) %>% #remove stuff we don't want like links
unnest_tokens(word, text, token = 'regex', pattern = unnest_reg) %>% #tokenize
filter(!word %in% stop_words$word, str_detect(word, '[A-Za-z]')) %>% #remove stop words
select(date, word, is_potus, favorite_count, id_str, month) #choose the variables we need
Our data for this part of the tutorial is now ready. Let’s have a look at it again before discussing lexicons.
tidy_tweets %>% head(5)
## # A tibble: 5 × 6
## date word is_potus favorite_count id_str month
## <dttm> <chr> <lgl> <int> <chr> <date>
## 1 2009-05-04 18:54:25 tune FALSE 202 1698308935 2009-05-01
## 2 2009-05-04 18:54:25 watch FALSE 202 1698308935 2009-05-01
## 3 2009-05-04 18:54:25 donald FALSE 202 1698308935 2009-05-01
## 4 2009-05-04 18:54:25 trump FALSE 202 1698308935 2009-05-01
## 5 2009-05-04 18:54:25 late FALSE 202 1698308935 2009-05-01
The gist is that we are dealing with a specific, pre-defined vocabulary. Of course, any analysis will only be as good as the lexicon. The goal is usually to assign a sentiment score to a text, possibly an overall score, or a generally positive or negative grade. Given that, other analyses may be implemented to predict sentiment
The tidytext package comes with a four existing sentiment lexicons or dictionaries. These describe the emotional content of individual words in different formats, and have been put together manually. We will only be considering three of these.
The fourth – loughran is for use with financial documents.
afinn <- get_sentiments('afinn')
bing <- get_sentiments('bing')
save(afinn, bing, file = "dsfi-lexicons.Rdata")
We can now have a look at each of the lexicons:
load("dsfi-lexicons.Rdata")
afinn %>% slice_sample(n = 5) %>% head(6)
## # A tibble: 5 × 2
## word value
## <chr> <dbl>
## 1 motivate 1
## 2 darkness -1
## 3 amazing 4
## 4 cruel -3
## 5 attracted 1
bing %>% slice_sample(n = 5) %>% head(6)
## # A tibble: 5 × 2
## word sentiment
## <chr> <chr>
## 1 godlike positive
## 2 aggrivation negative
## 3 incomprehensible negative
## 4 illuminating positive
## 5 saintly positive
Below we use the bing lexicon to add a new variable
indicating whether each word in our tidy_tweets data frame
is positive or negative. We use a left join here, which keeps
all the words in tidy_tweets.
Words appearing in our tweets but not in the bing lexicon
will appear as NA. We rename these “neutral”, but need to
be a bit careful here. No sentiment lexicon contains all words, so some
words that are actually positive or negative will be labelled
as NA and hence “neutral”. We can avoid this problem by
using an inner join rather than a left join, by filtering out neutral
words later on, or by just keeping in mind that “neutral” doesn’t really
mean “neutral”.
There’s one last issue: in the bing lexicon the word “trump” is positive, which will obviously skew the sentiment of Trump’s tweets, particularly bearing in mind he often tweets about himself! We manually recode the sentiment of this word to “neutral”.
tidy_tweets <- tidy_tweets %>%
left_join(bing) %>% #add sentiments (pos or neg)
select(word, sentiment, everything()) %>%
mutate(sentiment = ifelse(word == 'trump', NA, sentiment)) %>% #'trump' is a positive word in the bing lexicon!
mutate(sentiment = ifelse(is.na(sentiment), 'neutral', sentiment))
Let’s look at Trump’s 20 most common positive words:
tidy_tweets %>%
filter(sentiment == 'positive') %>%
count(word) %>%
arrange(desc(n)) %>%
filter(rank(desc(n)) <= 20) %>%
ggplot(aes(reorder(word,n),n)) + geom_col() + coord_flip() + xlab('')
And the 20 most common negative words:
tidy_tweets %>%
filter(sentiment == 'negative') %>%
count(word) %>%
arrange(desc(n)) %>%
filter(rank(desc(n)) <= 20) %>%
ggplot(aes(reorder(word,n),n)) + geom_col() + coord_flip() + xlab('')
Once we have attached sentiments to words in our data frame, we can analyze these in various ways. For example, we can examine trends in sentiment over time. Here we count the number of positive, negative and neutral words used each month and plot these. Because the neutral words dominate, its difficult to see any trends with them included. We therefore remove the neutral words before plotting.
sentiments_per_month <- tidy_tweets %>%
group_by(month, sentiment) %>%
summarize(n = n())
ggplot(filter(sentiments_per_month, sentiment != 'neutral'), aes(x = month, y = n, fill = sentiment)) +
geom_col()
It seems to be relatively balanced throughout, although the variation in the number of tweets made each month makes it difficult to say with any certainty which sentiments dominate over time. We can improve the visualization by plotting the proportion of all words tweeted in a month that were positive or negative. The plot shows the raw proportions as well as smoothed versions of these.
sentiments_per_month <- sentiments_per_month %>%
left_join(sentiments_per_month %>%
group_by(month) %>%
summarise(total = sum(n))) %>%
mutate(freq = n/total)
sentiments_per_month %>% filter(sentiment != 'neutral') %>%
ggplot(aes(x = month, y = freq, colour = sentiment)) +
geom_line() +
geom_smooth(aes(colour = sentiment))
See that he initially had positive sentiment overall - before 2010. In fact proportion wise, he had positive sentiment all the way up to 2016. Then once he became president, this appeared to changed - post 2016. Ultimately, over time this difference between positive and negative sentiment became less distinct, and in fact appears to have majority negative sentiment tweets in more recent years.
We can fit a simple linear model to check with the proportion of negative words has increased over time. Strictly speaking the linear model is not appropriate as the response is bounded to lie between 0 and 1 - you could try fitting e.g. a binomial GLM instead.
model <- lm(freq ~ month, data = subset(sentiments_per_month, sentiment == 'negative'))
summary(model)
##
## Call:
## lm(formula = freq ~ month, data = subset(sentiments_per_month,
## sentiment == "negative"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.044031 -0.011884 -0.001153 0.013793 0.052094
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.821e-01 3.326e-02 -5.475 3.00e-07 ***
## month 1.534e-05 2.045e-06 7.502 2.12e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02052 on 105 degrees of freedom
## Multiple R-squared: 0.349, Adjusted R-squared: 0.3428
## F-statistic: 56.28 on 1 and 105 DF, p-value: 2.118e-11
So far we’ve looked at the sentiment of individual words, but how can we assess the sentiment of longer sequences of text, like bi-grams, sentences or entire tweets? One approach is to attach sentiments to each word in the longer sequence, and then add up the sentiments over words. This isn’t the only way, but it is relatively easy to do and fits in nicely with the use of tidy text data.
Suppose we want to analyze the sentiment of entire tweets. We’ll measure the positivity of a tweet by the difference in the number of positive and negative words used in the tweet.
sentiments_per_tweet <- tidy_tweets %>%
group_by(id_str) %>%
summarize(net_sentiment = (sum(sentiment == 'positive') - sum(sentiment == 'negative')),
month = first(month))
To see if the measure makes sense, let’s have a look at the most negative tweets.
tweets %>%
left_join(sentiments_per_tweet) %>%
arrange(net_sentiment) %>%
head(5) %>%
select(text, net_sentiment)
## # A tibble: 5 × 2
## text net_s…¹
## <chr> <int>
## 1 Where’s the Collusion? They made up a phony crime called Collusion, a… -11
## 2 WITCH HUNT! There was no Russian Collusion. Oh, I see, there was no R… -9
## 3 This is an illegally brought Rigged Witch Hunt run by people who are … -9
## 4 How come every time I show anger, disgust or impatience, enemies say … -8
## 5 So, the Democrats make up a phony crime, Collusion with the Russians,… -8
## # … with abbreviated variable name ¹net_sentiment
And the most positive tweets:
tweets %>%
left_join(sentiments_per_tweet) %>%
arrange(desc(net_sentiment)) %>%
head(5) %>%
select(text, net_sentiment)
## # A tibble: 5 × 2
## text net_s…¹
## <chr> <int>
## 1 "Congratulations to Patrick Reed on his great and courageous MASTERS … 10
## 2 "Thank you, @WVGovernor Jim Justice, for that warm introduction. Toni… 7
## 3 "Today, as we celebrate Hispanic Heritage Month, we share our gratitu… 7
## 4 "It is my great honor to be with so many brilliant, courageous, patri… 7
## 5 "\"Success is not the key to happiness. Happiness is the key to succe… 6
## # … with abbreviated variable name ¹net_sentiment
We can also look at trends over time. The plot below shows the proportion of monthly tweets that were negative (i.e. where the number of negative words exceeded the number of positive ones).
sentiments_per_tweet %>%
group_by(month) %>%
summarize(prop_neg = sum(net_sentiment < 0) / n()) %>%
ggplot(aes(x = month, y = prop_neg)) +
geom_line() + geom_smooth()
Interestingly, we see that around 2010 very few negative sentiment tweets, in contrast o more recent years.
One problem we haven’t considered yet is what to do with terms like “not good”, where a positive word is negated by the use of “not” before it. We need to reverse the sentiment of words that are preceded by negation words like not, never, etc.
We’ll do this in the context of a sentiment analysis on bi-grams. We start by creating the bi-grams, and separating the two words making up each bi-gram. This is the same code used in the previous section.
bigrams_separated <- tweets %>%
filter(is_retweet == FALSE) %>%
mutate(text = str_replace_all(text, replace_reg, '')) %>%
unnest_tokens(bigram, text, token = 'ngrams', n = 2) %>%
separate(bigram, c('word1', 'word2'), sep = ' ')
Then we use the bing sentiment dictionary to look up the sentiment of each word in each bi-gram.
bigrams_separated <- bigrams_separated %>%
# add sentiment for word 1
left_join(bing, by = c(word1 = 'word')) %>%
rename(sentiment1 = sentiment) %>%
mutate(sentiment1 = ifelse(word1 == 'trump', NA, sentiment1)) %>%
mutate(sentiment1 = ifelse(is.na(sentiment1), 'neutral', sentiment1)) %>%
# add sentiment for word 2
left_join(bing, by = c(word2 = 'word')) %>%
rename(sentiment2 = sentiment) %>%
mutate(sentiment2 = ifelse(word2 == 'trump', NA, sentiment2)) %>%
mutate(sentiment2 = ifelse(is.na(sentiment2), 'neutral', sentiment2)) %>%
select(month, word1, word2, sentiment1, sentiment2, everything())
bigrams_separated %>% head(5)
## # A tibble: 5 × 14
## month word1 word2 senti…¹ senti…² is_re…³ favor…⁴ in_re…⁵ retwe…⁶ creat…⁷
## <date> <chr> <chr> <chr> <chr> <lgl> <int> <chr> <int> <chr>
## 1 2009-05-01 be sure neutral neutral FALSE 202 <NA> 253 Mon Ma…
## 2 2009-05-01 sure to neutral neutral FALSE 202 <NA> 253 Mon Ma…
## 3 2009-05-01 to tune neutral neutral FALSE 202 <NA> 253 Mon Ma…
## 4 2009-05-01 tune in neutral neutral FALSE 202 <NA> 253 Mon Ma…
## 5 2009-05-01 in and neutral neutral FALSE 202 <NA> 253 Mon Ma…
## # … with 4 more variables: id_str <chr>, source <chr>, date <dttm>,
## # is_potus <lgl>, and abbreviated variable names ¹sentiment1, ²sentiment2,
## # ³is_retweet, ⁴favorite_count, ⁵in_reply_to_user_id_str, ⁶retweet_count,
## # ⁷created_at
Now we need a list of words that we consider to be negation words. We’ll use the following set, taken from TMR Chapter 4, and show a few examples.
negation_words <- c('not', 'no', 'never', 'without')
# show a few
filter(bigrams_separated, word1 %in% negation_words) %>%
head(10) %>% select(month, word1, word2, sentiment1, sentiment2) # for display purposes
## # A tibble: 10 × 5
## month word1 word2 sentiment1 sentiment2
## <date> <chr> <chr> <chr> <chr>
## 1 2009-05-01 never be neutral neutral
## 2 2009-05-01 not be neutral neutral
## 3 2009-05-01 not a neutral neutral
## 4 2010-07-01 never be neutral neutral
## 5 2011-01-01 no environmental neutral neutral
## 6 2011-07-01 no revenue neutral neutral
## 7 2011-07-01 not negotiate neutral neutral
## 8 2011-07-01 no deal neutral neutral
## 9 2011-07-01 not answered neutral neutral
## 10 2011-07-01 not like neutral positive
We now reverse the sentiment of word2 whenever it is
preceded by a negation word, and then add up the number of positive and
negative words within a bi-gram and take the difference. That difference
(a score from -2 to +2) is the sentiment of the bi-gram.
We do this in two steps for illustrative purposes. First we reverse the sentiment of the second word in the bi-gram if the first one is a negation word.
bigrams_separated <- bigrams_separated %>%
# create a variable that is the opposite of sentiment2
mutate(opp_sentiment2 = recode(sentiment2, 'positive' = 'negative',
'negative' = 'positive',
'neutral' = 'neutral')) %>%
# reverse sentiment2 if word1 is a negation word
mutate(sentiment2 = ifelse(word1 %in% negation_words, opp_sentiment2, sentiment2)) %>%
# remove the opposite sentiment variable, which we don't need any more
select(-opp_sentiment2)
Next, we calculate the sentiment of each bi-gram and join up the words in the bi-gram again.
bigrams_separated <- bigrams_separated %>%
mutate(net_sentiment = (sentiment1 == 'positive') + (sentiment2 == 'positive') -
(sentiment1 == 'negative') - (sentiment2 == 'negative')) %>%
unite(bigram, word1, word2, sep = ' ', remove = FALSE)
bigrams_separated %>% select(word1, word2, sentiment1, sentiment2, net_sentiment)
## # A tibble: 592,367 × 5
## word1 word2 sentiment1 sentiment2 net_sentiment
## <chr> <chr> <chr> <chr> <int>
## 1 be sure neutral neutral 0
## 2 sure to neutral neutral 0
## 3 to tune neutral neutral 0
## 4 tune in neutral neutral 0
## 5 in and neutral neutral 0
## 6 and watch neutral neutral 0
## 7 watch donald neutral neutral 0
## 8 donald trump neutral neutral 0
## 9 trump on neutral neutral 0
## 10 on late neutral neutral 0
## # … with 592,357 more rows
Below we show Trump’s most common positive and negative bigrams.
bigrams_separated %>%
filter(net_sentiment > 0) %>% # get positive bigrams
count(bigram, sort = TRUE) %>%
filter(rank(desc(n)) < 20) %>%
ggplot(aes(reorder(bigram,n),n)) + geom_col() + coord_flip() + xlab('')
bigrams_separated %>%
filter(net_sentiment < 0) %>% # get negative bigrams
count(bigram, sort = TRUE) %>%
filter(rank(desc(n)) < 20) %>%
ggplot(aes(reorder(bigram,n),n)) + geom_col() + coord_flip() + xlab('')
None of the most common negative bi-grams have negated words in them but some that are slightly less frequently used do. Notice that the joint most frequently used bi-gram below is “no wonder” - which is not really negative, although you can see how, using the approach we have taken, it has ended up classified as such. Cases like these would need to be handled on an individual basis.
bigrams_separated %>%
filter(net_sentiment < 0) %>% # get negative bigrams
filter(word1 %in% negation_words) %>% # get bigrams where first word is negation
count(bigram, sort = TRUE) %>%
filter(rank(desc(n)) < 20) %>%
ggplot(aes(reorder(bigram,n),n)) + geom_col() + coord_flip() + xlab('')
We look at sentiment in Shakespeare’s Romeo and Juliet. Let’s begin by loading the data, taken from online.
library(gutenbergr)
load("gutenberg_shakespeare.RData")
rnj <- works$`Romeo and Juliet`
We’ve got the text now, but there is still work to be done. We first slice off the initial parts we don’t want like title, author etc. Then we get rid of other tidbits that would interfere, using a little regex as well to aid the process.
rnj_filtered = rnj %>%
slice(-(1:49)) %>%
filter(!text==str_to_upper(text), # will remove THE PROLOGUE etc.
!text==str_to_title(text), # will remove names/single word lines
!str_detect(text, pattern='^(Scene|SCENE)|^(Act|ACT)|^\\[')) %>%
select(-gutenberg_id) %>%
unnest_tokens(sentence, input=text, token='sentences') %>%
mutate(sentenceID = 1:n())
rnj_filtered
## # A tibble: 3,318 × 2
## sentence sentenceID
## <chr> <int>
## 1 two households, both alike in dignity, 1
## 2 in fair verona, where we lay our scene, 2
## 3 from ancient grudge break to new mutiny, 3
## 4 where civil blood makes civil hands unclean. 4
## 5 from forth the fatal loins of these two foes 5
## 6 a pair of star-cross'd lovers take their life; 6
## 7 whose misadventur'd piteous overthrows 7
## 8 doth with their death bury their parents' strife. 8
## 9 the fearful passage of their death-mark'd love, 9
## 10 and the continuance of their parents' rage, 10
## # … with 3,308 more rows
The following unnests the data to word tokens. In addition, you can remove stopwords like a, an, the etc., and tidytext comes with a stop_words data frame.
# show some of the matches
stop_words$word[which(stop_words$word %in% sentiments$word)] %>% head(20)
## [1] "appreciate" "appropriate" "available" "awfully"
## [5] "best" "better" "clearly" "enough"
## [9] "like" "liked" "reasonably" "right"
## [13] "sensible" "sorry" "thank" "unfortunately"
## [17] "unlikely" "useful" "welcome" "well"
# remember to call output 'word' or antijoin won't work without a 'by' argument
rnj_filtered = rnj_filtered %>%
unnest_tokens(output=word, input=sentence, token='words') %>%
anti_join(stop_words)
Now we add the sentiments via the inner_join function. Here I use ‘bing’, but you can use another, and you might get a different result.
rnj_filtered %>%
count(word) %>%
arrange(desc(n))
## # A tibble: 3,288 × 2
## word n
## <chr> <int>
## 1 thou 276
## 2 thy 165
## 3 love 140
## 4 thee 139
## 5 romeo 110
## 6 night 83
## 7 death 71
## 8 hath 64
## 9 sir 58
## 10 art 55
## # … with 3,278 more rows
rnj_sentiment = rnj_filtered %>%
inner_join(sentiments)
rnj_sentiment
## # A tibble: 2,077 × 3
## sentenceID word sentiment
## <int> <chr> <chr>
## 1 1 dignity positive
## 2 2 fair positive
## 3 3 grudge negative
## 4 3 break negative
## 5 4 unclean negative
## 6 5 fatal negative
## 7 7 overthrows negative
## 8 8 death negative
## 9 8 strife negative
## 10 9 fearful negative
## # … with 2,067 more rows
rnj_sentiment_bing = rnj_sentiment
table(rnj_sentiment_bing$sentiment)
##
## negative positive
## 1244 833
Looks like this one is going to be a downer. The following visualizes
the positive and negative sentiment scores as one progresses sentence by
sentence through the work using the plotly package. I also
show same information expressed as a difference (opaque line)
library(plotly)
ay <- list(
tickfont = list(color = "#2ca02c40"),
overlaying = "y",
side = "right",
# title = "sentiment difference",
titlefont = list(textangle=45),
zeroline = F
)
rnj_sentiment_bing %>%
arrange(sentenceID) %>%
mutate(positivity = cumsum(sentiment=='positive'),
negativity = cumsum(sentiment=='negative')) %>%
plot_ly() %>%
add_lines(x=~sentenceID, y=~positivity, name='positive') %>%
add_lines(x=~sentenceID, y=~negativity, name='negative') %>%
add_lines(x=~sentenceID, y=~positivity-negativity, name='difference',
yaxis = "y2",
opacity=.25) %>%
layout(
xaxis = list(dtick = 200),
yaxis = list(title='absolute cumulative sentiment'),
yaxis2 = ay
)
It’s a close game until perhaps the midway point, when negativity takes over and despair sets in with the story.
general, sentiment analysis can be a useful exploration of data, but it is highly dependent on the context and tools used. Note also that ‘sentiment’ can be anything, it doesn’t have to be positive vs. negative. Any vocabulary may be applied, and so it has more utility than the usual implementation.
It should also be noted that the above demonstration is largely
conceptual and descriptive. While fun, it’s a bit simplified. For
starters, trying to classify words as simply positive or negative itself
is not a straightforward endeavor. As we noted at the beginning, context
matters, and in general you’d want to take it into account. Modern
methods of sentiment analysis would use approaches like
word2vec or deep learning to predict a sentiment
probability, as opposed to a simple word match. Even in the above,
matching sentiments to texts would probably only be a precursor to
building a model predicting sentiment, which could then be applied to
new data
Finally, it should be noted that there is a great detailed summary of some of the techniques applied and introduced here in the Text mining for R book, particularly the link shown.
In this final example we look at applying word frequencies, comparisons between texts, sentiment analysis and wordclouds. The data set used is a series of reviews for the OnePlus phone models. Found here.
We will need a URL from where to download the data. You can download the data using this link
# download data
url <- "https://raw.github.com/VladAluas/Text_Analysis/master/Datasets/Text_review.csv"
# read data
reviews <- read_csv(url)
reviews
## # A tibble: 433 × 3
## Model Segment Text
## <chr> <chr> <chr>
## 1 OnePlus 1 Introduction "The days of the $600 smartphon…
## 2 OnePlus 1 Design, Features, and Call Quality "The OnePlus One doesn't feel l…
## 3 OnePlus 1 Design, Features, and Call Quality "Our white test unit features a…
## 4 OnePlus 1 Design, Features, and Call Quality "The 5.5-inch, 1080p IPS displa…
## 5 OnePlus 1 Design, Features, and Call Quality "There are two speaker grilles …
## 6 OnePlus 1 Design, Features, and Call Quality "With GSM (850/900/1800/1900MHz…
## 7 OnePlus 1 Design, Features, and Call Quality "Call quality, unfortunately, w…
## 8 OnePlus 1 Design, Features, and Call Quality "I also noticed a bug when it c…
## 9 OnePlus 1 Design, Features, and Call Quality "Also onboard are dual-band 802…
## 10 OnePlus 1 Performance and CyanogenMod "The OnePlus One is powered by …
## # … with 423 more rows
Let’s have a look at the data.
# view
head(reviews)
## # A tibble: 6 × 3
## Model Segment Text
## <chr> <chr> <chr>
## 1 OnePlus 1 Introduction "The days of the $600 smartphone…
## 2 OnePlus 1 Design, Features, and Call Quality "The OnePlus One doesn't feel li…
## 3 OnePlus 1 Design, Features, and Call Quality "Our white test unit features a …
## 4 OnePlus 1 Design, Features, and Call Quality "The 5.5-inch, 1080p IPS display…
## 5 OnePlus 1 Design, Features, and Call Quality "There are two speaker grilles f…
## 6 OnePlus 1 Design, Features, and Call Quality "With GSM (850/900/1800/1900MHz)…
The data is structured in three columns: the model number, the segment of the review and the text from the segment.
We have chosen to keep each paragraph from each review as a separate text because it’s easier to work with, and it’s more realistic. This is most likely how you might analyse the data when you read and compare the reviews section by section.
as things stand we know that we can’t count words or quantify them in any way, so we will need to transform the last column into a more analysis friendly format. Let’s look at tokenizing our data.
# activate some libraries
library(tidytext)
library(tidyverse)
reviews %>%
# We need to specify the name of the column to be created (Word) and the source column (Text)
unnest_tokens("Word", "Text")
## # A tibble: 30,067 × 3
## Model Segment Word
## <chr> <chr> <chr>
## 1 OnePlus 1 Introduction the
## 2 OnePlus 1 Introduction days
## 3 OnePlus 1 Introduction of
## 4 OnePlus 1 Introduction the
## 5 OnePlus 1 Introduction 600
## 6 OnePlus 1 Introduction smartphone
## 7 OnePlus 1 Introduction aren't
## 8 OnePlus 1 Introduction over
## 9 OnePlus 1 Introduction quite
## 10 OnePlus 1 Introduction yet
## # … with 30,057 more rows
The function took all the sentences from the Text column and broke them down into a format that has one word per row and way more rows than before. So, our new data structure is one step away from a tidy format, all we need to do is count each word to see how many times it appears in the text, and then we will have a tidy format.
As mentioned in the first part of the tutorial, the function has (and does) transformed all the words to lower case and removed all the special symbols (e.g. the $ from the price described in the introduction of the OnePlus 1). This is important because it can save us a lot of headaches when cleaning the data.
Now we will transform the data in a proper tidy format. To
do so, we will unnest the sentences, we will count each
word, and then we can display the frequencies on a graph. Because we
want to use the graph later, we will create a function,
word_frequency() that contains all the steps we want to
apply to the graph. We will also replace some characters, so we will not
double or under count some words. In the function we also make sure to
create a factor from the word column with the levels showing the most
frequent words as top level, purely for aesthetic purposes.
# tokenize
reviews_tidy <- reviews %>%
unnest_tokens("Word", "Text") %>%
mutate(Word = str_replace(Word, "'s", "")) # prevent the analysis in showing 6t and 6t's as two separate words
# create a function that will store all the operations we will repeat several times
word_frequency <- function(x, top = 10){
x %>%
count(Word, sort = TRUE) %>% #need a word count
mutate(Word = factor(Word, levels = rev(unique(Word)))) %>%
top_n(top) %>%
ungroup() %>% # useful later if we want to use a grouping variable and will do nothing if we don't
# The graph itself
ggplot(mapping = aes(x = Word, y = n)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(x = NULL)
}
Let us try out this function now.
reviews_tidy %>%
word_frequency(15)
There, frequency analysis done. Now, what does the word the say about
the OnePlus brand of phones? Nothing. As mentioned earlier this is a
stop word. Determiners and conjunctions (e.g. the, and, a, to) are the
most frequently used words in any language and do not tell us much about
the message of a sentence, not by themselves at least. These are called
stop words, and we will eliminate them, so we can focus on the words
that can give us a better picture of the text.
Again, we shall use a data set called stop_words that
contains a list of all the determiners and conjunctions, adverbs and
adjectives that we can eliminate from a text, so we can analyse it
properly. Now, we can recreate the previous graph after we eliminate the
stop words and see what it tells us about the OnePlus phones
overall.
# Same dataset as before with an extra code line
reviews_tidy <- reviews %>%
unnest_tokens("Word", "Text") %>%
anti_join(stop_words, by = c("Word" = "word")) %>% # anti_join just keeps the rows common to both data sets
mutate(Word = str_replace(Word, "'s", ""))
reviews_tidy %>%
word_frequency(15)
#> Selecting by n
As an overall idea, we can see that the brand name (OnePlus) is the most used, as we would expect. Then, we can see phone, which is to be expected since we are talking about a product that is a phone.
We can also see that galaxy is mentioned quite a lot, just as much as camera which is again expected. OnePlus promoted themselves as a brand with high performance models at a cheaper price than a flagship from Samsung or other makers, therefore it would be only natural to see the comparison between the two.
Another pairing we see is low and light which is the part in the reviews where they are comparing camera performance in low light. Also you might have spotted that 7 and 8 are there as well. This is most likely because the 7 from all the OnePlus 7 series is mentioned quite a lot, the same goes for the 8.
Now, the graph we used earlier and have been looking at shows us the most frequently used words across all texts in the corpus. This is useful because it gives us some good insight on what are the words most associated with OnePlus as a brand overall.
But, I would also like to have the top 5 words associated with each model. We can do so by adding two lines of code to the previous chunk. It’s as simple as below.
reviews_tidy %>%
group_by(Model) %>%
word_frequency(5) +
facet_wrap(~ Model, scales = "free_y") # This is just to split the graph into multiple graphs for each model
We have a matrix of graphics that shows us which terms are most frequently associated with a model and that is very useful from a business perspective.
Now we have found ourselves in a good position. We have these graphs, and need to critiquley analyse them - now we can automate this, to determine what is the conclusion for each model.
We will use a method called Term frequency – inverse document
frequency. Note that we shall cover this in much greater detail in
another tutorial (Bag of Words and Topic Modelling Tutorial). For now,
we shall briefly cover this topic. We saw in the first graph, that the
the most frequent terms in the review are the ones with no analytical
value whatsoever, the, and, a, etc. Words that have a high analytical
value (e.g. performance) will appear less often. Now, the
tf_idf method works based on this principle something like
this.
We can check for the words that are frequent in one review and not
the others to see what distinguishes one document from another. This
comparison can be done with a simple formula bind_tf_idf()
that assigns weights to words using the principles below:
words with high frequency in all the documents: low weight
words with high frequency in just one of the documents and not the other: high weight
words with low frequency across the board: low weight
Let’s see this in practice.
review_tf_idf <-
reviews_tidy %>%
count(Model, Word, sort = TRUE) %>%
bind_tf_idf(Word, Model, n)
review_tf_idf %>%
arrange(desc(tf_idf))
## # A tibble: 8,213 × 6
## Model Word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 OnePlus 6T 6t 31 0.0293 2.08 0.0609
## 2 OnePlus 5T 5t 21 0.0195 2.77 0.0540
## 3 OnePlus 6 s9 18 0.0183 2.77 0.0508
## 4 OnePlus 7T McLaren mclaren 20 0.0292 1.67 0.0489
## 5 OnePlus 2 s6 11 0.0164 2.77 0.0456
## 6 OnePlus 7 Pro 5G 5g 40 0.0620 0.693 0.0430
## 7 OnePlus 7T 7t 22 0.0306 1.39 0.0425
## 8 OnePlus 8 Pro s20 25 0.0245 1.67 0.0410
## 9 OnePlus 8 s20 25 0.0223 1.67 0.0373
## 10 OnePlus 3T 3t 21 0.0222 1.67 0.0372
## # … with 8,203 more rows
Now we can display this using plots. Note, we need to sort the data in descending order so we can create the factors for each term. We also create the factors as we did previously. Then we proceed to plotting.
review_tf_idf %>%
arrange(desc(tf_idf)) %>%
mutate(Word = factor(Word, levels = rev(unique(Word)))) %>%
group_by(Model) %>%
top_n(5) %>%
ungroup() %>%
ggplot(mapping = aes(x = Word, y = tf_idf, fill = Model)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = NULL) +
coord_flip() +
facet_wrap(~ Model, scales = "free_y")
These are the main items that separate one review from the other. Amongst them we can see the main flagship of Samsung, especially for latter reviews, they seem to compare the brands quite a lot. We can also single out that for One Plus 7 Pro 5G there is a problem with overheating and the OnePlus 6 is described as elegant.
Of course, this can be tweaked quite a bit depending on your needs. You can eliminate words, you can replace some of them, or you can add a different grouping to the analysis.
Now let’s dive into some sentiment analysis. In this context, it can be used in data analysis to quickly get an idea about the product - phone. We shall use a lexicon and simply associate each word in the review to a sentiment. Then it becomes a simple matter of counting how many words are associated with positive or negative sentiments to get the overall affect of the text.
Let’s proceed by using the AFINN library to check the
sentiment for each model and see how they perform. We will use just the
conclusion for each review as that should be the most relevant in
transmitting the overall sentiment for the whole review.
However, we have to keep in mind that these being technical reviews, they might contain a terminology different from the one used in natural language, and the analysis might not be as accurate.
conclusion_afinn <- reviews %>%
filter(str_detect(Segment, "Conclusion")) %>%
unnest_tokens("Word", "Text") %>%
anti_join(stop_words, by = c("Word" = "word")) %>%
# We will get the sentiments with a inner_join since the words that don't have a match, don't have a score value
inner_join(get_sentiments("afinn"), by = c("Word" = "word"))
conclusion_afinn
## # A tibble: 122 × 4
## Model Segment Word value
## <chr> <chr> <chr> <dbl>
## 1 OnePlus 1 Cameras and Conclusions cut -1
## 2 OnePlus 1 Cameras and Conclusions true 2
## 3 OnePlus 1 Cameras and Conclusions alive 1
## 4 OnePlus 1 Cameras and Conclusions true 2
## 5 OnePlus 1 Cameras and Conclusions miss -2
## 6 OnePlus 1 Cameras and Conclusions straight 1
## 7 OnePlus 1 Cameras and Conclusions capable 1
## 8 OnePlus 1 Cameras and Conclusions free 1
## 9 OnePlus 1 Cameras and Conclusions demand -1
## 10 OnePlus 1 Cameras and Conclusions impress 3
## # … with 112 more rows
As you can see, each token has been unnested, and assigned a sentiment value. Now, in order to check the sentiments for each review, all we need to do is add the scores and plot them.
conclusion_afinn %>%
group_by(Model) %>%
summarise(Score = sum(value)) %>%
arrange(desc(Score)) %>%
mutate(Model = factor(Model, levels = rev(unique(Model)))) %>%
ggplot(mapping = aes(x = Model, y = Score)) +
geom_col() +
coord_flip() +
labs(x = NULL)
The scores are in and overall the Oneplus 2 has the best reviews.However, what if we want to see a report on which model has the most positive and negative reviews? For that we would use the bing library.
conclusion_bing <- reviews %>%
filter(str_detect(Segment, "Conclusion")) %>%
unnest_tokens("Word", "Text") %>%
anti_join(stop_words, by = c("Word" = "word")) %>%
inner_join(get_sentiments("bing"), by = c("Word" = "word"))
conclusion_bing
## # A tibble: 189 × 4
## Model Segment Word sentiment
## <chr> <chr> <chr> <chr>
## 1 OnePlus 1 Cameras and Conclusions led positive
## 2 OnePlus 1 Cameras and Conclusions distortion negative
## 3 OnePlus 1 Cameras and Conclusions miss negative
## 4 OnePlus 1 Cameras and Conclusions dynamic positive
## 5 OnePlus 1 Cameras and Conclusions distortion negative
## 6 OnePlus 1 Cameras and Conclusions warped negative
## 7 OnePlus 1 Cameras and Conclusions unnatural negative
## 8 OnePlus 1 Cameras and Conclusions admirable positive
## 9 OnePlus 1 Cameras and Conclusions soft positive
## 10 OnePlus 1 Cameras and Conclusions prefer positive
## # … with 179 more rows
Now we can proceed with the same steps, just add the sentiment to the grouping.
conclusion_bing %>%
group_by(Model, sentiment) %>%
count() %>%
ungroup() %>%
mutate(Model = reorder(Model, n)) %>%
ggplot(mapping = aes(x = Model, y = n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(x = NULL, y = "Negative vs positive sentiment / Model") +
facet_wrap(~ sentiment, ncol = 2)
For example, the OnePlus 6T and the OnePlus 7 (for China) have no negative reviews, but they also have only a few positive things said about them. This seems to be reflected in their placement in the previous graph as well.
Both these approaches have their advantages and disadvantages and in practice you will most likely use a combination of both, not just one. It is really useful to view a problem from multiple angles.
However please note that the lexicons we have used here are applied for just one word, and that can miss the sentiment of a phase (e.g. not good is a negative term, however the lexicon will see not as neutral and good as positive, therefore overall it will see it as positive). In order to avoid situations like this we can use pairing of words and check for these types of situation.
As such neagtion is not looked at here in this example.
With that in mind, we look to employ our final method - wordclouds.
Wordclouds are a different approach in presenting the data. I personally find them very useful when you are trying to communicate the prevalence of a word in a text or speech. They basically have the same role as a pie chart, but they’re way better because they display data in a more user-friendly way. Using a wordcloud will allow you to look at it and see how frequent a word is without having to check and re-check a legend for dozens of times.
With that said, let’s check our wordcloud. It should show the same
data as the first graph, just in a different display style, so I will
use the same data set reviews_tidy. For this we will use
the wordcloud package.
library(wordcloud)
reviews_tidy %>%
count(Word) %>%
with(wordcloud(Word, n, max.words = 100))
As you can see, the results are similar to the first analysis, the more frequent a word, the larger the font. However, with this type of graph we can include a lot more items. In this we have included 100 words, as opposed to 15 in the first graph
Let;’s see how we can use a wordcloud for sentiment analysis.
library(reshape2)
reviews_tidy %>%
inner_join(get_sentiments("bing"), by = c("Word" = "word")) %>%
count(Word, sentiment, sort = TRUE) %>%
acast(Word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("#202121", "#797C80"),
max.words = 50)
This is a very quick and useful way to show which elements influence the sentiment for your product the most and make decisions based on it.
We can clearly see that the words that influence the most the negative scores are noise, expensive and loud while the ones that influence the positive reviews are excellent, fast and smooth.