Overview and Introduction

The internet is ripe with data sets that you can use for your own personal projects. Often times you wont be able to directly ask for the data or have access to get it in a neat format. That is; while data has many sources, its biggest repository is on the web. When this happens, we need to turn to web scraping, a technique where we get the data we want to analyze by finding it in a website’s HTML code

Web scraping is the process of extracting data from websites. More succiently, it is the process of automatically extracting content and data from a website.

This is achieved by actually extracting underlying HTML code and, with it, data stored in a database. It can be done manually, but typically when we talk of web scraping we mean gathering data from websites by automated means.

Web scraping involves two distinct processes: fetching or downloading the web page, and extracting data from it. In this tutorial we introduce the rvest package, which provides various web scraping functions. We also introduce the SelectorGadget tool and show how to use it to identify the parts of a webpage we want. Finally, we shall also apply it on two different scraping examples; scraping property data and movie reviews.

References

We shall be using examples throughout the tutorial based of the working of Ian Durbach; his profile is found here. This tutorial is based of teaching from DataQuest’s article on web scraping in R, as well as some work from R for Data Science book particularly Chapter 14 on Strings. A lot of the work follows suit from Data Science for Industry course from university of Cape Town. Other great resources that are used for this tutorial are shown below:

Background

Web scraping involves working with HTML files, the language used to construct web pages. We introduce bits and pieces of HTML as needed, but do not cover these from first principles or in great detail. There is a nice basic introduction to HTML here. I don’t believe it is entirely necessary to be an HTML wizard to be able to scrape data effectively.

Rough Guide to HTML

Before we can start learning how to scrape a web page, we need to understand how a web page itself is structured. The main languages used to build web pages are called Hypertext Markup Language (HTML), Cascasing Style Sheets (CSS) and Javascript. HTML gives a web page its actual structure and content. CSS gives a web page its style and look, including details like fonts and colors. Javascript gives a webpage functionality. It is useful to have a rough idea how everything fits together with respect to HTML and websites underlying operation, which is summarised below:

  • Websites are written using HTML (Hypertext Markup Language), a markup programming language. A web page is basically an HTML file. An HTML file is a plain-text file in which the text is written using the HTML language i.e. contains HTML commands, content, etc. HTML files can be linked to one another, which is how a web site is put together.

  • An HTML file, and hence a web page, consists of two main parts: HTML tags and content. HTML tags are the parts of a web page that define how content is formatted and displayed in a web browser. Its easiest to explain with a small example. Below is a minimal HTML file: the tags are the commands within angle brackets e.g. <head>. That is; notice that the word “html” is surrounded by <> brackets, which indicates that it is a tag. Try copying the text below to a text editor, save as .html, and open in your browser. Tags can be customised with tag attributes. Also for more info on HTML tags and other elements look here. Also notice that each of the tags are “paired” in a sense that each one is accompanied by another with a similar name. That is to say, the opening <html> tag is paired with another tag </html> that indicates the beginning and end of the HTML document.

<html>
<head>
<title>A simple webpage</title>
</head>
<body>

Some content. More <b>very important</b> content.

</body>
</html>
  • CSS is Cascading Style Sheets, a ‘style sheet language’. A style sheet language is a programming language that controls how certain kinds of documents are structured. CSS is a style sheet language for markup documents like those written using HTML. Style sheets define things like the colour and layout of text and other HTML tags. Specifically, when we say styling, we are referring to a wide, wide range of things. Styling can refer to the color of particular HTML elements or their positioning. Like HTML, the scope of CSS material is so large that we can’t cover every possible concept in the language. If you’re interested, you can learn more here. Separating presentation from content is often useful e.g. multiple HTML pages can share formatting through a shared CSS (.css) file.

  • A CSS file is written as a set of rules. Each rule consists of a selector and a declaration. The CSS selector points to the HTML element the declaration refers to. The declaration contains instructions about how the HTML element identified by the CSS selector should be presented. CSS selectors identify HTML elements by matching tags and tag attributes. There’s a fun tutorial on CSS selectors here.

Applications and Use-Cases

If there’s data on a website, then in theory, it’s scrapable! Common data types organizations collect include images, videos, text, product information, customer sentiments and reviews (on sites like Twitter, Yell, or TripAdvisor), and pricing from comparison websites.

Web scraping is used in a variety of digital businesses that rely on data harvesting. Legitimate use cases include:

  • Price comparison sites deploying bots to auto-fetch prices and product descriptions for allied seller websites.

  • Market research companies using scrapers to pull data from forums and social media (e.g., for sentiment analysis).

  • Amazon or eBay scrape data from product sites to support competitor analysis

  • Google regularly uses web scraping to analyze, rank, and index their content.

Web scraping invariably involves copying data, and thus copyright issues are often involved. As such, web scraping is also used for illegal purposes, including the undercutting of prices and the theft of copyrighted content. With this, we see now that there are some legal rules about what types of information you can scrape. Having said this, web scraping has a dark underbelly. Bad players often scrape data like bank details or other personal information to conduct fraud, scams, intellectual property theft, and extortion.

Automated web scraping software can process data much more quickly that manual web users, placing a strain on host web servers. Scraping may also be against the terms of service of some websites. The bottom line is that the ethics of web scraping is not straightforward, and is evolving. There is lots of useful information on the web about these issues, for example here, here, and here.

The rvest Package

The rvest library, maintained by Hadley Wickham, is a library that lets users easily scrape (“harvest”) data from web pages.

The package, rvest is one of the tidyverse libraries, so it works well with the other libraries contained in the bundle. It takes inspiration from the web scraping library BeautifulSoup, which comes from Python - more details and a tutorial of this can be found here.

The package has several key functions used to scrape data from the web:

  • read_html(): Extracts the html code from a url.

  • html_node(): Extracts the content and tags relating to specified css elements

  • html_text(): Extracts text content from nodes

  • html_attr(): get HTML nodes of a particular type, like hyperlinks.

Also, if the page contains tabular data you can convert it directly to a data frame with html_table() (Extracts tables from nodes).

There are a number of tools that allow us to inspect web pages and see “what is under the hood”. We shall be looking at SelectorGadget.

How to Scrape: Generally

Scraping involves more than just simply executing code and hoping for the best. The exact method for carrying out these steps depends on the tools you’re using, we shall cover one such approach below, but for now we shall describe the (non-technical) basics.

  1. Find the URLs you want to scrape

Figure out which website(s) you want to scrape.

  1. Inspect the page

Before coding your web scraper, you need to identify what it has to scrape. Right-clicking anywhere on the frontend of a website gives you the option to ‘inspect element’ or ‘view page source.’ This reveals the site’s back-end code, which is what the scraper will read. We shall also show how to use SelectorGadget tool for this purpose.

  1. Identify the data you want to extract
Your aim is to identify the unique tags that enclose (or ‘nest’) the relevant content (e.g. 

tags).

  1. Write the necessary code

Once you’ve found the appropriate nest tags, you’ll need to incorporate these into your scraping code. We shall use rvest package. We shall use the functions in the package to tell it where to look and what to extract. When you’re coding your web scraper, it’s important to be as specific as possible about what you want to collect.

  1. Execute the code

Once you’ve written the code, the next step is to execute it. The scraper requests site access, extracts the data, and parses it.

  1. Storing the data

After extracting, parsing, and collecting the relevant data, you’ll need to store it. You can instruct your algorithm to do this by adding extra lines to your code.

We also make use of regular expressions to extract neated and easier to read set of data.

How to Scrape: rvest

There are several steps involved in using rvest which are conceptually quite straightforward:

1.Identify a URL to be examined for content

  1. Use SelectorGadget, xPath, or Google Insepct to identify the “selector” This will be a paragraph, table, hyper links, images

  2. Load rvest

  3. Use read_html to “read” the URL

  4. Pass the result to html_nodes to get the selectors identified in step number 2

  5. Get the text or table content

  • rvest uses CSS selectors to identify the parts of the web page to scrape.

Setup

First we load the packages we’ll need in this workbook.

library(rvest)
library(tidyverse)
library(stringr)

Example 1: Using the Selector Gadget

This example was directly taken from Ian Durbach’s work. We’ll use the SelectorGadget tool to find the CSS selectors for headlines on the Daily Maverick website. Then we’ll use the rvest package to scrape the headings and save them as strings in R.

First, make sure you’ve got the SelectorGadget tool available in your web browser’s toolbar. Go to http://selectorgadget.com/ and find the link that says ‘drag this link to your bookmark bar’. You only need to do this once.

Now let’s visit the Daily Maverick webpage. Click on the SelectorGadget tool and identify the CSS selectors for headlines. It should just be h1, although this may change with time, and will likely be different for different news sites. Another way of identifying specific elements on a web page is to open the element inspector.

Finally, let’s switch over to R and scrape the headlines. We first read in the webpage using read_html. This simply reads in an HTML document, which can be from a url, a file on disk or a string. It returns an XML (another markup language) document.

dm_page <- read_html('https://www.dailymaverick.co.za/')
dm_page
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="home page-template-default page page-id-1057736 wp-custom-lo ...

We extract relevant information from the document with html_nodes. This returns a set of XML element nodes, each one containing the tag and contents (e.g. text) associated with the specified CSS selectors:

dm_elements <- html_nodes(x = dm_page, css = 'h1')
dm_elements[1:5]
## {xml_nodeset (5)}
## [1] <h1><a target="_self" href="/" title="Daily Maverick"><img class="img-res ...
## [2] <h1>Spree of arrests of top cops exposes the depth of the rot alleged in  ...
## [3] <h1>What happens to the money lost to corruption after the state claws it ...
## [4] <h1>Spanish federation backs women’s soccer manager as players quit natio ...
## [5] <h1>SA signs agreements with three independent power producers, but is it ...

To get just the text inside the element nodes we use html_text, with trim = TRUE to clean up white space characters.

dm_text <- html_text(dm_elements, trim = TRUE) 
as_tibble(dm_text) %>% head(5)
## # A tibble: 5 × 1
##   value                                                                         
##   <chr>                                                                         
## 1 ""                                                                            
## 2 "Spree of arrests of top cops exposes the depth of the rot alleged in SAPS’s …
## 3 "What happens to the money lost to corruption after the state claws it back?" 
## 4 "Spanish federation backs women’s soccer manager as players quit national tea…
## 5 "SA signs agreements with three independent power producers, but is it too li…

If this resulting table contains some stuff we don’t want then we can clean up the text later. For now this suffices as a simple example of using SelectorGadget tool and using the simple functions in rvest.


Example 2: Scraping Tables

One especially useful form of scraping is getting tables containing data from websites. This example shows you how to do that.

We’ll use the now well-known table on the worldometers webpage, containing the latest available coronavirus data. Before running the code below, visit the webpage and use SelectorGadget to identify the CSS selector you need. For this illustration, we will select all elements corresponding to table.

First, read the webpage as before:

covid_page <- read_html('https://www.worldometers.info/coronavirus/')

Extract the table element(s) with html_nodes().

covid_elements <- html_nodes(covid_page, 'table')

View the extracted elements. Say we want yesterday’s table, to extract the daily increases in infections and deaths.

covid_elements
## {xml_nodeset (3)}
## [1] <table id="main_table_countries_today" class="table table-bordered table- ...
## [2] <table id="main_table_countries_yesterday" class="table table-bordered ta ...
## [3] <table id="main_table_countries_yesterday2" class="table table-bordered t ...

Use html_table() to extract the tables inside the second element of covid_elements. Remember we select the second element, since we want data from yesterday’s table.

covid_table <- html_table(covid_elements[[2]])
head(covid_table[,1:6], 3)
## # A tibble: 3 × 6
##     `#` `Country,Other` TotalCases  NewCases TotalDeaths NewDeaths
##   <int> <chr>           <chr>       <chr>    <chr>           <int>
## 1    NA Asia            188,953,762 +155,476 1,478,239         319
## 2    NA North America   116,104,166 +16,250  1,536,722          87
## 3    NA Europe          225,899,773 +135,605 1,917,574         213

Wait, is this only per continent?

covid_table[1:20,1:6]
## # A tibble: 20 × 6
##      `#` `Country,Other` TotalCases  NewCases   TotalDeaths NewDeaths
##    <int> <chr>           <chr>       <chr>      <chr>           <int>
##  1    NA "Asia"          188,953,762 "+155,476" 1,478,239         319
##  2    NA "North America" 116,104,166 "+16,250"  1,536,722          87
##  3    NA "Europe"        225,899,773 "+135,605" 1,917,574         213
##  4    NA "South America" 64,030,870  "+12,053"  1,329,045          80
##  5    NA "Oceania"       12,346,720  "+1,139"   20,621             NA
##  6    NA "Africa"        12,641,789  "+178"     257,592            NA
##  7    NA ""              721         ""         15                 NA
##  8    NA "World"         619,977,801 "+320,701" 6,539,808         699
##  9     1 "China"         249,172     "+188"     5,226              NA
## 10     2 "USA"           97,895,860  "+15,377"  1,081,708          75
## 11     3 "India"         44,568,114  "+4,777"   528,510            23
## 12     4 "France"        35,125,681  "+38,024"  154,887            NA
## 13     5 "Brazil"        34,673,221  "+6,834"   685,837            21
## 14     6 "Germany"       32,952,050  ""         149,458            NA
## 15     7 "S. Korea"      24,594,336  "+29,315"  28,140             63
## 16     8 "UK"            23,621,952  ""         189,919            NA
## 17     9 "Italy"         22,284,812  "+22,360"  176,867            43
## 18    10 "Japan"         20,982,896  "+64,053"  44,262             85
## 19    11 "Russia"        20,746,163  "+51,269"  386,662           111
## 20    12 "Turkey"        16,873,793  ""         101,139            NA

No, although note that China is (at time of writing) listed as the first country, with the rest ordered according to total cases. It is always a good idea to, if possible, check your results against the website. Also bear in mind that webpages are dynamic and can yield different results over time.

We can also use the pipe operator to string all these commands together. Note the use of .[[i]], which is the operation ‘extract the i-th element’.

covid_table_piped <- read_html('https://www.worldometers.info/coronavirus/') %>% html_nodes('table') %>% .[[2]] %>%  html_table() 
head(covid_table_piped[,1:6], 4)
## # A tibble: 4 × 6
##     `#` `Country,Other` TotalCases  NewCases TotalDeaths NewDeaths
##   <int> <chr>           <chr>       <chr>    <chr>           <int>
## 1    NA Asia            188,953,762 +155,476 1,478,239         319
## 2    NA North America   116,104,166 +16,250  1,536,722          87
## 3    NA Europe          225,899,773 +135,605 1,917,574         213
## 4    NA South America   64,030,870  +12,053  1,329,045          80

We shall now move onto more complex examples.


Example 3: Scraping House Property Data

This is a more advanced example where we scrape data on houses for sale in a particular area of interest off the Property24 website.

The landing page for a suburb shows summaries for the first 20 houses. At the bottom of the page are links to further pages, each containing 20 house summaries. First we read in the landing page and identify all hyperlinks on that page. Links are all identified by CSS selector a. We want to extract the hypertext reference (href).

# if the link below doesn't work, it may be out-of-date... just find another
suburb <- read_html('https://www.property24.com/for-sale/waterfront/cape-town/western-cape/9169')
suburb_links <- suburb %>% html_nodes('a') %>% html_attr('href') 
print(suburb_links[1:6])
## [1] "https://www.microsoft.com/en-us/edge"                                
## [2] "/"                                                                   
## [3] "/"                                                                   
## [4] "/for-sale/waterfront/cape-town/western-cape/9169"                    
## [5] "/commercial-property-for-sale/waterfront/cape-town/western-cape/9169"
## [6] "/for-sale/waterfront/cape-town/western-cape/9169?sp=r%3dTrue"

Next, we need to identify just those hyperlinks that load pages with house summaries (let’s call these ‘summary pages’). We do this by matching pattern with regular expressions. A regular expression is a sequence of characters that specifies a search pattern in text. We shall dive into more details regarding regular expressions at a later stage.

We use the str_subset function to look for specific patterns in the url. Specifically one that ends with \(9169\).

suburb_pages <- str_subset(suburb_links,'(http).*(for-sale).*(9169)')
suburb_pages
## [1] "https://www.property24.com/for-sale/waterfront/cape-town/western-cape/9169/p2"
## [2] "https://www.property24.com/for-sale/waterfront/cape-town/western-cape/9169"   
## [3] "https://www.property24.com/for-sale/waterfront/cape-town/western-cape/9169/p2"
## [4] "https://www.property24.com/for-sale/waterfront/cape-town/western-cape/9169/p3"
## [5] "https://www.property24.com/for-sale/waterfront/cape-town/western-cape/9169/p4"

We see duplicates. Since “Next” and Page 2 both have same links. We can remove duplicate links using the unique command. Now in the case here, we have that only 4 pages are provided on the website for houses for sale at Waterfront Cape Town.

If we see that there are more pages of interest than could have been accessed through that initial page, we could manually fill in the gaps in our list of suburb pages, although a good way would be to generalise and automate this next bit.

This is shown below. Say we wanted pages 7 to 14 as well.

suburb_pages <- sort(unique(c(suburb_pages, paste0(suburb_pages[2], '/p', 7:14))))
suburb_pages

Now at this stage we have url’s for all the property24 pages for Waterfront in Cape Town. So now, for each of the summary pages, we extract the hyperlinks that lead to the full house ads.

So we shall cycle through all the webpages we have of the properties.

house_links <- c()
for(i in suburb_pages){
  suburb_i <- read_html(i)
  suburb_i_links <- suburb_i %>% html_nodes('a') %>% html_attr('href') 
  house_links_i <- str_subset(suburb_i_links,'(for-sale).*(9169/)[0-9]{9}$')
  house_links <- c(house_links, house_links_i)
}
# remove any duplicates and reorder
house_links <- sort(unique(house_links))

Now let’s take a look at the links we got. Also the length (how many properties).

house_links[1:6]
## [1] "/for-sale/waterfront/cape-town/western-cape/9169/102066306"
## [2] "/for-sale/waterfront/cape-town/western-cape/9169/105419812"
## [3] "/for-sale/waterfront/cape-town/western-cape/9169/105827693"
## [4] "/for-sale/waterfront/cape-town/western-cape/9169/105922448"
## [5] "/for-sale/waterfront/cape-town/western-cape/9169/106075464"
## [6] "/for-sale/waterfront/cape-town/western-cape/9169/106104293"
paste("Number  of links/properties scraped:",length(house_links))
## [1] "Number  of links/properties scraped: 78"

We now read each of those pages and extract the following variables on each house:

  • Price
  • Erf size
  • Number of bedrooms
  • Number of bathrooms
  • Number of garages

as well as the ad text.

Note, that unfortunately its quite easy to get blocked (temporarily) by Property24 for making too many requests. Just do 10 here for example.

house_data <- data.frame()

for(i in house_links[1:10]){  
  
  # read house ad html
  house <- read_html(paste0('https://www.property24.com',i))
  
  # get the ad text 
  ad <- house %>% html_nodes(css = '.js_readMoreText') %>% html_text(trim = T)
  
  # get house data
  price <- house %>% html_nodes(css = '.p24_price') %>% html_text(trim = TRUE) %>% .[[2]]
  erfsize <- house %>% html_nodes(css = '.p24_size span') 
  nbeds <- house %>% html_nodes(css = '.p24_listingFeatures:nth-child(1) .p24_featureAmount') %>% html_text(trim = TRUE) %>% as.numeric()
  nbaths <- house %>% html_nodes(css = '.p24_listingFeatures:nth-child(2) .p24_featureAmount') %>% html_text(trim = TRUE) %>% as.numeric()
  ngar <- house %>% html_nodes(css = '.p24_listingFeatures:nth-child(3) .p24_featureAmount') %>% html_text(trim = TRUE) %>% as.numeric()
  
  # if couldn't find data on webpage, replace with NA
  price <- ifelse(length(price) > 0, price, NA)
  erfsize <- ifelse(length(erfsize) > 0, html_text(erfsize, trim = TRUE), NA)
  nbeds <- ifelse(length(nbeds) > 0, nbeds, NA)
  nbaths <- ifelse(length(nbaths) > 0, nbaths, NA)
  ngar <- ifelse(length(ngar) > 0, ngar, NA)
  
  # store results
  this_house <- data.frame(price = price, erfsize = erfsize, nbeds = nbeds, nbaths = nbaths, ngar = ngar, ad = ad)
  house_data <- rbind.data.frame(house_data,this_house)
  
  # See if random wait between 1 and 3 seconds avoids excessive requesting
  Sys.sleep(runif(1, 1, 3))
}

View the data.

house_data[1:5]
##           price erfsize nbeds nbaths ngar
## 1  R 13 500 000  146 m²     2    2.5    2
## 2  R 55 000 000  504 m²     3    4.0    3
## 3   R 8 995 000  110 m²     1    1.5    1
## 4  R 29 995 000  221 m²     3    3.5    2
## 5  R 27 594 250  280 m²     4    4.0    4
## 6  R 16 094 250  174 m²     3    2.0   NA
## 7   R 8 995 000  111 m²     2    2.5    2
## 8   R 9 500 000  115 m²     2    2.0    2
## 9   R 8 975 000  112 m²     2    2.5    2
## 10 R 14 500 000  152 m²     2    2.5    2

Nicely done. We were able to scrape data from property24’s site; specifically, we were able to scrape all the data regarding the listings for sale in Waterfront.


Example 4: Getting Movie Reviews

In this final example we shall retrieve movie reviews from the links of certain movies provided to use from loading a data set.

load('data/movielens-small.RData') #Contains 9742 movies
load('output/recommender.RData')   #Contains our subset of movies

# make into a tibble
links <- as_tibble(links)
head(links)
## # A tibble: 6 × 3
##   movieId imdbId tmdbId
##     <int>  <int>  <int>
## 1       1 114709    862
## 2       2 113497   8844
## 3       3 113228  15602
## 4       4 114885  31357
## 5       5 113041  11862
## 6       6 113277    949

The links data frame provides identifiers for each movie for three different movie data sets: MovieLens, IMDb, and The Movie Database. This gives us a way of looking up reviews for a particular movieId we are interested in on either IMDb or The Movie Database.

IMDb links are 7 characters long, so we need to add leading zeros in some cases.

links$imdbId <- sprintf('%07d',links$imdbId)

Let’s extract just the movies that we used to build our recommender systems in the last lesson, and get the IMDB identifiers for those movies. We also shall rename one of the Harry Potter movies, while doing this.

imdbId_to_use <- distinct(ratings_red, movieId, .keep_all = T) %>% 
  select(movieId, title) %>% 
  inner_join(links, by = 'movieId') %>% 
  select(-title, title) %>% 
  arrange(imdbId) %>% 
  mutate(title = replace(title, str_detect(title, 'Potter.*Sorc'), 'Harry Potter and the Philosopher\'s Stone (2001)'))

imdbId_to_use %>% head(5)
## # A tibble: 5 × 4
##   movieId imdbId  tmdbId title                                      
##     <int> <chr>    <int> <chr>                                      
## 1     924 0062622     62 2001: A Space Odyssey (1968)               
## 2    1208 0078788     28 Apocalypse Now (1979)                      
## 3    1258 0081505    694 Shining, The (1980)                        
## 4    2115 0087469     87 Indiana Jones and the Temple of Doom (1984)
## 5    2918 0091042   9377 Ferris Bueller's Day Off (1986)

Next we need to know a little more about how reviews are displayed on IMDb. We see that only a certain number of reviews are displayed by default, with the option to “load more” at the bottom. At this point we need to interact with the webpage, for which the RSelenium package is still recommended. However, RSelenium can now only be run via Docker, which adds a whole new level of complexity that takes us beyond a reasonable scope for this tutorial. Therefore, we will only be scraping the visible reviews.

reviews <- data.frame()

# just get the first two movies to save time
for(j in 1:2){
  
  this_movie <- imdbId_to_use$imdbId[j]
  link <- paste0('http://www.imdb.com/title/tt',this_movie,'/reviews?ref_=tt_ql3')
  movie_imdb <- read_html(link)
  
  # Used SelectorGadget as the CSS Selector
  imdb_review <- movie_imdb %>% html_nodes('.text.show-more__control') %>% html_text()
  
  this_review <- data.frame(imbdId = this_movie, review = imdb_review, stringsAsFactors = F)
  reviews <- rbind.data.frame(reviews, this_review)
}

reviews <- as_tibble(reviews)

We’ll now look in a bit more detail on working with text. Let’s look at the first review.

review1 <- as.character(reviews$review[1])
review1
## [1] "Sometimes reading the user comments on IMDB fills me with despair for the species.  For anybody to dismiss 2001: A Space Odyssey as \"boring\" they must have no interest in science, technology, philosophy, history or the art of film-making.  Finally I understand why most Hollywood productions are so shallow and vacuous - they understand their audience.Thankfully, those that cannot appreciate Kubrick's accomplishment are still a minority.  Most viewers are able to see the intelligence and sheer virtuosity that went into the making of this epic.  This is the film that put the science in \"science fiction\", and its depiction of space travel and mankind's future remains unsurpassed to this day.  It was so far ahead of its time that humanity still hasn't caught up.2001 is primarily a technical film.  The reason it is slow, and filled with minutae is because the aim was to realistically envision the future of technology (and the past, in the awe inspiring opening scenes).  The film's greatest strength is in the details.  Remember that when this film was made, man still hadn't made it out to the moon... but there it is in 2001, and that's just the start of the journey.  To create such an incredibly detailed vision of the future that 35 years later it is still the best we have is beyond belief - I still can't work out how some of the shots were done.  The film's only notable mistake was the optimism with which it predicted mankind's technological (and social) development.  It is our shame that the year 2001 did not look like the film 2001, not Kubrick's.Besides the incredible special effects, camera work and set design, Kubrick also presents the viewer with a lot of food for thought about what it means to be human, and where the human race is going.  Yes, the ending is weird and hard to comprehend - but that's the nature of the future.  Kubrick and Clarke have started the task of envisioning it, now it's up to the audience to continue.  There's no neat resolution, no definitive full stop, because then the audience could stop thinking after the final reel.  I know that's what most audiences seem to want these days, but Kubrick isn't going to let us off so lightly.I'm glad to see that this film is in the IMDB top 100 films, and only wish that it were even higher.  Stanley Kubrick is one of the very finest film-makers the world has known, and 2001 his finest accomplishment. 10/10."

Although this is readable to us, we would like to clean it up a bit before we work with it. The first thing we can do is remove all the punctuation. We do this with a call to str_replace_all() and a ‘regular expression’, a way of describing patterns in strings. Here :alnum: refers to any alphanumeric character, equivalent to [A-Za-z0-9]. In this context ^ means negation, so we’re removing anything that’s not alphanumeric (replacing it with nothing).

review1_nopunc <- str_replace_all(review1, '[^[:alnum:] ]', '')

Finally we can convert everything to lowercase. Note that there might still be some problems we’d like to fix up, most often when two words get concatenated (e.g. ‘ambitiononly’ in the final sentence). Getting text totally clean can be hard work.

review1_clean <- tolower(review1_nopunc)
review1_clean
## [1] "sometimes reading the user comments on imdb fills me with despair for the species  for anybody to dismiss 2001 a space odyssey as boring they must have no interest in science technology philosophy history or the art of filmmaking  finally i understand why most hollywood productions are so shallow and vacuous  they understand their audiencethankfully those that cannot appreciate kubricks accomplishment are still a minority  most viewers are able to see the intelligence and sheer virtuosity that went into the making of this epic  this is the film that put the science in science fiction and its depiction of space travel and mankinds future remains unsurpassed to this day  it was so far ahead of its time that humanity still hasnt caught up2001 is primarily a technical film  the reason it is slow and filled with minutae is because the aim was to realistically envision the future of technology and the past in the awe inspiring opening scenes  the films greatest strength is in the details  remember that when this film was made man still hadnt made it out to the moon but there it is in 2001 and thats just the start of the journey  to create such an incredibly detailed vision of the future that 35 years later it is still the best we have is beyond belief  i still cant work out how some of the shots were done  the films only notable mistake was the optimism with which it predicted mankinds technological and social development  it is our shame that the year 2001 did not look like the film 2001 not kubricksbesides the incredible special effects camera work and set design kubrick also presents the viewer with a lot of food for thought about what it means to be human and where the human race is going  yes the ending is weird and hard to comprehend  but thats the nature of the future  kubrick and clarke have started the task of envisioning it now its up to the audience to continue  theres no neat resolution no definitive full stop because then the audience could stop thinking after the final reel  i know thats what most audiences seem to want these days but kubrick isnt going to let us off so lightlyim glad to see that this film is in the imdb top 100 films and only wish that it were even higher  stanley kubrick is one of the very finest filmmakers the world has known and 2001 his finest accomplishment 1010"

Messing Around and Exploring on IMDb

Let’s try now and scrape all the reviews from the IMDb site for Indiana Jones and the Temple of Doom (1984).

# Get info on movie
imdbId_to_use[which(imdbId_to_use$title == "Indiana Jones and the Temple of Doom (1984)"),]
## # A tibble: 1 × 4
##   movieId imdbId  tmdbId title                                      
##     <int> <chr>    <int> <chr>                                      
## 1    2115 0087469     87 Indiana Jones and the Temple of Doom (1984)
# initialise
reviews_indianajones <- data.frame()

# movie ID for IMDb
this_movie <- imdbId_to_use$imdbId[4]

# get url
link <- paste0('http://www.imdb.com/title/tt',this_movie,'/reviews?ref_=tt_ql3')

# read html from url
movie_imdb <- read_html(link)
  
# Used SelectorGadget as the CSS Selector
imdb_review <- movie_imdb %>% html_nodes('.text.show-more__control') %>% html_text()

# put all info and reviews into data frame  
this_review <- data.frame(imbdId = this_movie, review = imdb_review, stringsAsFactors = F)
  
# combine review with other reviews
reviews_indianajones <- rbind.data.frame(reviews_indianajones, this_review)

# convert to tibble
reviews_indianajones <- as_tibble(reviews_indianajones)

reviews_indianajones %>% head(4)
## # A tibble: 4 × 2
##   imbdId  review                                                                
##   <chr>   <chr>                                                                 
## 1 0087469 "I know that there are a lot of haters when it comes to Indiana Jones…
## 2 0087469 "It's funny to call \"Indiana Jones and the Temple of Doom\" a follow…
## 3 0087469 "Indiana Jones and the Temple of Doom is the second of the Indy films…
## 4 0087469 "(re-Review): I've never disliked this movie, but it's also been a ha…

Let’s look at a specific review. Say review 13.

reviews_indianajones[13,]$review
## [1] "The adventurer and archaeologist Indiana Jones (Harrison Ford) with his bullwhip wielding and hat will fight against nasty enemies in India along with an oriental little boy (Jonathan Ke Quan) and a night club Singer (Kate Capshaw who married Steven Spielberg). Jones agrees with the village's inhabitants look for a lost magic stone. Meanwhile , they stumble with a secret thug cult ruled by an evil priest (Amrish Puri).The Indiana Jones adventures trilogy tries to continue the vintage pathes from the thirty years ago greatest classics , and the comics-books narrative , along with the special characteristics of the horror films of the 80s decade , as it is well reflected in the creepy and spooky scenes about human sacrifices . The picture is directed with great style and high adventure and driven along with enormous fair-play in the stunning mounted action set-pieces . Harrison Ford plays splendidly the valiant and brave archaeologist turned into an action man .Kate Capshaw interprets a scream girl who'll have a little romance with Indy . The movie blends adventures , noisy action , rip-snorting , humor , tongue-in-chek , it is a cinematic roller coaster ride and pretty bemusing . The motion picture has great loads of action , special effects galore and the usual and magnificent John Williams musical score . The glimmering and dazzling cinematography is efficiently realized by Douglas Slocombe . The pic was allrightly directed by Steven Spielberg. Film culminates in a spectacular finale that will have you on the edge of your seat . It's a must see for adventures aficionados , as perfect entertainment for all the family ."

How many numeric characters are contained in this review?

str_count(reviews_indianajones[13,]$review, "[0-9]")
## [1] 2

Now let’s see if we can scrape all the reviews from the IMDb site for all 20 movies. We shall scrape the reviews that appear without clicking ‘Load More’.

reviews_all_movies <- data.frame()

# just get the first two movies to save time
for(j in 1:20){
  
  this_movie <- imdbId_to_use$imdbId[j]
  link <- paste0('http://www.imdb.com/title/tt',this_movie,'/reviews/?ref_=tt_ql_urv')
  movie_imdb <- read_html(link)
  
  # Used SelectorGadget as the CSS Selector
  imdb_review <- movie_imdb %>% html_nodes('.show-more__control') %>% html_text()
  
  this_review <- data.frame(imbdId = this_movie, review = imdb_review, stringsAsFactors = F)
  reviews_all_movies <- rbind.data.frame(reviews_all_movies, this_review)
}

reviews_all_movies <- as_tibble(reviews_all_movies)

reviews_all_movies %>% head(4)
## # A tibble: 4 × 2
##   imbdId  review                                                                
##   <chr>   <chr>                                                                 
## 1 0062622 "Sometimes reading the user comments on IMDB fills me with despair fo…
## 2 0062622 ""                                                                    
## 3 0062622 "\n                "                                                  
## 4 0062622 "A stand-alone monument in cinema history, Stanley Kubrick's magnum o…

Let’s do some investigating on this now. Let’s remove some reviews that don’t look right. An example is shown below:

# Initialise a df with clean reviews
cleaned_reviews <- data.frame()

# loop to clean reviews
for (i in 1:nrow(reviews_all_movies)){
  reviews_all_movies[i, "review"] <- reviews_all_movies[i, "review"] %>% str_trim()
  this_review <- reviews_all_movies[i,"review"]
  if(nchar(this_review) != 0){
    this_movie <- reviews_all_movies[i,1]
    this_review <- data.frame(imbdId = this_movie, review = this_review, stringsAsFactors = F)
    cleaned_reviews <- rbind.data.frame(cleaned_reviews, this_review)
      }
}

# see result
#cleaned_reviews %>% head(4)

# do we got all reviews for all movies?
length(unique(cleaned_reviews$imbdId))
## [1] 20

Having gotten all the reviews without clicking load more for all 20 movies in our data set from IMDb; we shall now briefly explore it. The shortest review, first.

rev_chars <- reviews_all_movies 
rev_chars$num_chars <- rep(NA, nrow(rev_chars))

for (i in 1:nrow(rev_chars)){
  num_chars <- rev_chars[i,2] %>% str_length()
  rev_chars$num_chars[i] <- num_chars
}

# look at result
rev_chars %>% head(4)
## # A tibble: 4 × 3
##   imbdId  review                                                         num_c…¹
##   <chr>   <chr>                                                            <int>
## 1 0062622 "Sometimes reading the user comments on IMDB fills me with de…    2409
## 2 0062622 ""                                                                   0
## 3 0062622 ""                                                                   0
## 4 0062622 "A stand-alone monument in cinema history, Stanley Kubrick's …    9555
## # … with abbreviated variable name ¹​num_chars
# shortest review
rev_chars %>% select(imbdId, num_chars) %>% arrange(num_chars) %>% filter(num_chars>0)
## # A tibble: 497 × 2
##    imbdId  num_chars
##    <chr>       <int>
##  1 0258463        70
##  2 0087469        87
##  3 0407887       116
##  4 1049413       117
##  5 0129387       127
##  6 0110148       133
##  7 0078788       150
##  8 0129387       157
##  9 0108160       171
## 10 0091042       181
## # … with 487 more rows
# what movie?
imdbId_to_use %>% filter(imdbId == "0062622" | imdbId == "0407887")
## # A tibble: 2 × 4
##   movieId imdbId  tmdbId title                       
##     <int> <chr>    <int> <chr>                       
## 1     924 0062622     62 2001: A Space Odyssey (1968)
## 2   48516 0407887   1422 Departed, The (2006)
rev_chars %>% select(everything()) %>% arrange(num_chars) %>% filter(num_chars>0) %>% head(5)
## # A tibble: 5 × 3
##   imbdId  review                                                         num_c…¹
##   <chr>   <chr>                                                            <int>
## 1 0258463 I like the bit where he punched the bloke really hard in the …      70
## 2 0087469 ...is annoying as hell, but otherwise this is a very entertai…      87
## 3 0407887 Amazing performances, action, script, production, and directi…     116
## 4 1049413 It will put you through every emotion that a human can experi…     117
## 5 0129387 This movie is one of the funniest movies ever I have watched …     127
## # … with abbreviated variable name ¹​num_chars

Each movie’s review page on IMDb states how many reviews have been posted for that movie. We shall scrape this for all 20 movies.

num_reviews_all_movies <- data.frame()

# just get the first two movies to save time
for(j in 1:20){
  
  this_movie <- imdbId_to_use$imdbId[j]
  link <- paste0('http://www.imdb.com/title/tt',this_movie,'/reviews/?ref_=tt_ql_urv')
  movie_imdb <- read_html(link)
  
  # Used SelectorGadget as the CSS Selector
  imdb_review <- movie_imdb %>% html_nodes('.header span:nth-child(1)') %>% html_text()
  
  this_review <- data.frame(imbdId = this_movie, review = imdb_review, stringsAsFactors = F)
  num_reviews_all_movies <- rbind.data.frame(num_reviews_all_movies, this_review)
}

num_reviews_all_movies <- as_tibble(num_reviews_all_movies)

num_reviews_all_movies %>% head(4)
## # A tibble: 4 × 2
##   imbdId  review       
##   <chr>   <chr>        
## 1 0062622 2,466 Reviews
## 2 0078788 1,370 Reviews
## 3 0081505 2,148 Reviews
## 4 0087469 779 Reviews

Now lets keep only the digits (number of reviews), so we shall remove the text.

i= 1
for (i in 1:nrow(num_reviews_all_movies)){
  num_reviews_all_movies[i,2] <- str_replace_all(num_reviews_all_movies[i,2], '\\D', '')
}
num_reviews_all_movies$review <- as.numeric(num_reviews_all_movies$review)
names(num_reviews_all_movies)[2] <- "num_reviews"

Now we have aour tibble with the movie id for IMDb and a column indicating the numebr of reviews that movie has; we can sort these to view which has the most.

# which movie has highest num reviews
num_reviews_all_movies %>% arrange(desc(num_reviews)) %>% head(4)
## # A tibble: 4 × 2
##   imbdId  num_reviews
##   <chr>         <dbl>
## 1 0407887        2525
## 2 0062622        2466
## 3 0246578        2440
## 4 0081505        2148
# what movie?
imdbId_to_use %>% filter(imdbId == "0407887")
## # A tibble: 1 × 4
##   movieId imdbId  tmdbId title               
##     <int> <chr>    <int> <chr>               
## 1   48516 0407887   1422 Departed, The (2006)

Finally, some of the movie reviews on IMDb also provide ratings out of 10. We shall scrape these for all 20 movies.

ratings_all_movies <- data.frame()

# just get the first two movies to save time
for(j in 1:20){
  
  this_movie <- imdbId_to_use$imdbId[j]
  link <- paste0('http://www.imdb.com/title/tt',this_movie,'/reviews/?ref_=tt_ql_urv')
  movie_imdb <- read_html(link)
  
  # Used SelectorGadget as the CSS Selector
  imdb_review <- movie_imdb %>% html_nodes('.rating-other-user-rating span') %>% html_text()
  
  this_review <- data.frame(imbdId = this_movie, review = imdb_review, stringsAsFactors = F)
  ratings_all_movies <- rbind.data.frame(ratings_all_movies, this_review)
}

ratings_all_movies <- as_tibble(ratings_all_movies)

ratings_all_movies %>% head(4)
## # A tibble: 4 × 2
##   imbdId  review
##   <chr>   <chr> 
## 1 0062622 10    
## 2 0062622 /10   
## 3 0062622 10    
## 4 0062622 /10

This is not exactly what we wanted. Lets try remove that unwanted stuff.

new_rats <- data.frame()
j <- 1
for (i in seq(from = 1, to = nrow(ratings_all_movies), by = 2)){
  new_rats <- rbind.data.frame(new_rats, ratings_all_movies[i,])
  j <- j + 1
}
new_rats %>% head(4)
## # A tibble: 4 × 2
##   imbdId  review
##   <chr>   <chr> 
## 1 0062622 10    
## 2 0062622 10    
## 3 0062622 6     
## 4 0062622 7

Lovely. Now we can investigate which movie has the highest average rating and what is that average rating.

# change to numeric
new_rats$review <- as.numeric(new_rats$review)

# find mean and sort
new_rats %>% group_by(imbdId) %>% summarise(mean = mean(review, na.rm = TRUE), n = n()) %>% arrange(desc(mean)) %>% head(4)
## # A tibble: 4 × 3
##   imbdId   mean     n
##   <chr>   <dbl> <int>
## 1 0120689  9.61    23
## 2 1049413  9.55    22
## 3 0078788  9.33    21
## 4 0258463  8.5     24
# find movie title
imdbId_to_use %>% filter(imdbId=="0120689")
## # A tibble: 1 × 4
##   movieId imdbId  tmdbId title                 
##     <int> <chr>    <int> <chr>                 
## 1    3147 0120689    497 Green Mile, The (1999)

Example 5: Quick Tutorial with Tables

Click here, for a great walkthrough of using web scraping for tables.


A Note on Web Scraping

Big websites, like Google or Amazon, are designed to handle high traffic. Smaller sites are not. It’s therefore important that you don’t overload a site with too many HTTP requests, which can slow it down, or even crash it completely. In fact, this is a technique often used by hackers. They flood sites with requests to bring them down, in what’s known as a ‘denial of service’ attack. Make sure you don’t carry one of these out by mistake! Don’t scrape too aggressively, either; include plenty of time intervals between requests, and avoid scraping a site during its peak hours.

Web scraping does bring with it some ethical concerns. Its important to read about these and formulate your own opinion and approach, starting for example here, here, and here.


Additional Work

We can try and apply our learning son the following tasks.

  1. The Freakonomics Radio Archive contains all previous Freakonomics podcasts. Scrape the titles, dates and descriptions, and download URLs of all the podcasts and store them in a dataframe (see if you can download all the medically-themed podcasts).

  2. Decanter magazine provides one of the world’s best known wine ratings. Scrape the tasting notes, scores, and prices for their South African white wines (or whatever subset you choose).