Abstract
Tutorial looking at an introudction to web scraping methods. We cover
the basic uses of the rvest package as well as applications
on different web scraping examples. We also introduce the
SelecetorGadget tool for your browser.
The internet is ripe with data sets that you can use for your own personal projects. Often times you wont be able to directly ask for the data or have access to get it in a neat format. That is; while data has many sources, its biggest repository is on the web. When this happens, we need to turn to web scraping, a technique where we get the data we want to analyze by finding it in a website’s HTML code
Web scraping is the process of extracting data from websites. More succiently, it is the process of automatically extracting content and data from a website.
This is achieved by actually extracting underlying HTML code and, with it, data stored in a database. It can be done manually, but typically when we talk of web scraping we mean gathering data from websites by automated means.
Web scraping involves two distinct processes: fetching or
downloading the web page, and extracting data
from it. In this tutorial we introduce the rvest package,
which provides various web scraping functions. We also introduce the
SelectorGadget tool and show how to use it to identify the
parts of a webpage we want. Finally, we shall also apply it on two
different scraping examples; scraping property data and movie
reviews.
We shall be using examples throughout the tutorial based of the working of Ian Durbach; his profile is found here. This tutorial is based of teaching from DataQuest’s article on web scraping in R, as well as some work from R for Data Science book particularly Chapter 14 on Strings. A lot of the work follows suit from Data Science for Industry course from university of Cape Town. Other great resources that are used for this tutorial are shown below:
Web Scraping with R by Steve Pittard
Theoretical background on scraping from CareerFoundry and Imperva.
Web scraping involves working with HTML files, the language used to construct web pages. We introduce bits and pieces of HTML as needed, but do not cover these from first principles or in great detail. There is a nice basic introduction to HTML here. I don’t believe it is entirely necessary to be an HTML wizard to be able to scrape data effectively.
Before we can start learning how to scrape a web page, we need to understand how a web page itself is structured. The main languages used to build web pages are called Hypertext Markup Language (HTML), Cascasing Style Sheets (CSS) and Javascript. HTML gives a web page its actual structure and content. CSS gives a web page its style and look, including details like fonts and colors. Javascript gives a webpage functionality. It is useful to have a rough idea how everything fits together with respect to HTML and websites underlying operation, which is summarised below:
Websites are written using HTML (Hypertext Markup Language), a markup programming language. A web page is basically an HTML file. An HTML file is a plain-text file in which the text is written using the HTML language i.e. contains HTML commands, content, etc. HTML files can be linked to one another, which is how a web site is put together.
An HTML file, and hence a web page, consists of two main parts:
HTML tags and content. HTML tags are the parts of a web
page that define how content is formatted and displayed in a web
browser. Its easiest to explain with a small example. Below is a minimal
HTML file: the tags are the commands within angle brackets
e.g. <head>. That is; notice that the word “html” is
surrounded by <> brackets, which indicates that it is
a tag. Try copying the text below to a text editor, save as .html, and
open in your browser. Tags can be customised with tag
attributes. Also for more info on HTML tags and other elements
look here.
Also notice that each of the tags are “paired” in a sense that each one
is accompanied by another with a similar name. That is to say, the
opening <html> tag is paired with another tag
</html> that indicates the beginning and end of the
HTML document.
<html>
<head>
<title>A simple webpage</title>
</head>
<body>
Some content. More <b>very important</b> content.
</body>
</html>
CSS is Cascading Style Sheets, a ‘style sheet
language’. A style sheet language is a programming language that
controls how certain kinds of documents are structured. CSS is a style
sheet language for markup documents like those written using HTML. Style
sheets define things like the colour and layout of text and other HTML
tags. Specifically, when we say styling, we are referring to a wide,
wide range of things. Styling can refer to the color of particular HTML
elements or their positioning. Like HTML, the scope of CSS material is
so large that we can’t cover every possible concept in the language. If
you’re interested, you can learn more here.
Separating presentation from content is often useful e.g. multiple HTML
pages can share formatting through a shared CSS (.css)
file.
A CSS file is written as a set of rules. Each rule consists of a selector and a declaration. The CSS selector points to the HTML element the declaration refers to. The declaration contains instructions about how the HTML element identified by the CSS selector should be presented. CSS selectors identify HTML elements by matching tags and tag attributes. There’s a fun tutorial on CSS selectors here.
If there’s data on a website, then in theory, it’s scrapable! Common data types organizations collect include images, videos, text, product information, customer sentiments and reviews (on sites like Twitter, Yell, or TripAdvisor), and pricing from comparison websites.
Web scraping is used in a variety of digital businesses that rely on data harvesting. Legitimate use cases include:
Price comparison sites deploying bots to auto-fetch prices and product descriptions for allied seller websites.
Market research companies using scrapers to pull data from forums and social media (e.g., for sentiment analysis).
Amazon or eBay scrape data from product sites to support competitor analysis
Google regularly uses web scraping to analyze, rank, and index their content.
Web scraping invariably involves copying data, and thus copyright issues are often involved. As such, web scraping is also used for illegal purposes, including the undercutting of prices and the theft of copyrighted content. With this, we see now that there are some legal rules about what types of information you can scrape. Having said this, web scraping has a dark underbelly. Bad players often scrape data like bank details or other personal information to conduct fraud, scams, intellectual property theft, and extortion.
Automated web scraping software can process data much more quickly that manual web users, placing a strain on host web servers. Scraping may also be against the terms of service of some websites. The bottom line is that the ethics of web scraping is not straightforward, and is evolving. There is lots of useful information on the web about these issues, for example here, here, and here.
rvest PackageThe rvest library, maintained by Hadley Wickham, is a library that lets users easily scrape (“harvest”) data from web pages.
The package, rvest is one of the tidyverse libraries, so
it works well with the other libraries contained in the bundle. It takes
inspiration from the web scraping library BeautifulSoup,
which comes from Python - more details and a tutorial of this can be
found here.
The package has several key functions used to scrape data from the web:
read_html(): Extracts the html code from a
url.
html_node(): Extracts the content and tags relating
to specified css elements
html_text(): Extracts text content from
nodes
html_attr(): get HTML nodes of a particular type,
like hyperlinks.
Also, if the page contains tabular data you can convert it directly
to a data frame with html_table() (Extracts tables from
nodes).
There are a number of tools that allow us to inspect web pages and see “what is under the hood”. We shall be looking at SelectorGadget.
Scraping involves more than just simply executing code and hoping for the best. The exact method for carrying out these steps depends on the tools you’re using, we shall cover one such approach below, but for now we shall describe the (non-technical) basics.
Figure out which website(s) you want to scrape.
Before coding your web scraper, you need to identify what it has to
scrape. Right-clicking anywhere on the frontend of a website gives you
the option to ‘inspect element’ or ‘view page source.’ This reveals the
site’s back-end code, which is what the scraper will read. We shall also
show how to use SelectorGadget tool for this purpose.
tags).
Once you’ve found the appropriate nest tags, you’ll need to
incorporate these into your scraping code. We shall use
rvest package. We shall use the functions in the package to
tell it where to look and what to extract. When you’re coding your web
scraper, it’s important to be as specific as possible about what you
want to collect.
Once you’ve written the code, the next step is to execute it. The scraper requests site access, extracts the data, and parses it.
After extracting, parsing, and collecting the relevant data, you’ll need to store it. You can instruct your algorithm to do this by adding extra lines to your code.
We also make use of regular expressions to extract neated and easier to read set of data.
rvestThere are several steps involved in using rvest which
are conceptually quite straightforward:
1.Identify a URL to be examined for content
Use SelectorGadget, xPath, or
Google Insepct to identify the “selector” This will be a
paragraph, table, hyper links, images
Load rvest
Use read_html to “read” the URL
Pass the result to html_nodes to get the selectors
identified in step number 2
Get the text or table content
First we load the packages we’ll need in this workbook.
library(rvest)
library(tidyverse)
library(stringr)
This example was directly taken from Ian Durbach’s work. We’ll use the SelectorGadget tool to find the CSS selectors for headlines on the Daily Maverick website. Then we’ll use the rvest package to scrape the headings and save them as strings in R.
First, make sure you’ve got the SelectorGadget tool available in your web browser’s toolbar. Go to http://selectorgadget.com/ and find the link that says ‘drag this link to your bookmark bar’. You only need to do this once.
Now let’s visit the Daily
Maverick webpage. Click on the SelectorGadget tool and identify the
CSS selectors for headlines. It should just be h1, although
this may change with time, and will likely be different for different
news sites. Another way of identifying specific elements on a web page
is to open the element inspector.
Finally, let’s switch over to R and scrape the headlines. We first
read in the webpage using read_html. This simply reads in
an HTML document, which can be from a url, a file on disk or a string.
It returns an XML (another markup language) document.
dm_page <- read_html('https://www.dailymaverick.co.za/')
dm_page
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="home page-template-default page page-id-1057736 wp-custom-lo ...
We extract relevant information from the document with
html_nodes. This returns a set of XML element nodes, each
one containing the tag and contents (e.g. text) associated with the
specified CSS selectors:
dm_elements <- html_nodes(x = dm_page, css = 'h1')
dm_elements[1:5]
## {xml_nodeset (5)}
## [1] <h1><a target="_self" href="/" title="Daily Maverick"><img class="img-res ...
## [2] <h1>Spree of arrests of top cops exposes the depth of the rot alleged in ...
## [3] <h1>What happens to the money lost to corruption after the state claws it ...
## [4] <h1>Spanish federation backs women’s soccer manager as players quit natio ...
## [5] <h1>SA signs agreements with three independent power producers, but is it ...
To get just the text inside the element nodes we use
html_text, with trim = TRUE to clean up white
space characters.
dm_text <- html_text(dm_elements, trim = TRUE)
as_tibble(dm_text) %>% head(5)
## # A tibble: 5 × 1
## value
## <chr>
## 1 ""
## 2 "Spree of arrests of top cops exposes the depth of the rot alleged in SAPS’s …
## 3 "What happens to the money lost to corruption after the state claws it back?"
## 4 "Spanish federation backs women’s soccer manager as players quit national tea…
## 5 "SA signs agreements with three independent power producers, but is it too li…
If this resulting table contains some stuff we don’t want then we can
clean up the text later. For now this suffices as a simple example of
using SelectorGadget tool and using the simple functions in
rvest.
One especially useful form of scraping is getting tables containing data from websites. This example shows you how to do that.
We’ll use the now well-known table on the worldometers
webpage, containing the latest available coronavirus data. Before
running the code below, visit the webpage and use SelectorGadget to
identify the CSS selector you need. For this illustration, we will
select all elements corresponding to table.
First, read the webpage as before:
covid_page <- read_html('https://www.worldometers.info/coronavirus/')
Extract the table element(s) with html_nodes().
covid_elements <- html_nodes(covid_page, 'table')
View the extracted elements. Say we want yesterday’s table, to extract the daily increases in infections and deaths.
covid_elements
## {xml_nodeset (3)}
## [1] <table id="main_table_countries_today" class="table table-bordered table- ...
## [2] <table id="main_table_countries_yesterday" class="table table-bordered ta ...
## [3] <table id="main_table_countries_yesterday2" class="table table-bordered t ...
Use html_table() to extract the tables inside the second
element of covid_elements. Remember we select the second
element, since we want data from yesterday’s table.
covid_table <- html_table(covid_elements[[2]])
head(covid_table[,1:6], 3)
## # A tibble: 3 × 6
## `#` `Country,Other` TotalCases NewCases TotalDeaths NewDeaths
## <int> <chr> <chr> <chr> <chr> <int>
## 1 NA Asia 188,953,762 +155,476 1,478,239 319
## 2 NA North America 116,104,166 +16,250 1,536,722 87
## 3 NA Europe 225,899,773 +135,605 1,917,574 213
Wait, is this only per continent?
covid_table[1:20,1:6]
## # A tibble: 20 × 6
## `#` `Country,Other` TotalCases NewCases TotalDeaths NewDeaths
## <int> <chr> <chr> <chr> <chr> <int>
## 1 NA "Asia" 188,953,762 "+155,476" 1,478,239 319
## 2 NA "North America" 116,104,166 "+16,250" 1,536,722 87
## 3 NA "Europe" 225,899,773 "+135,605" 1,917,574 213
## 4 NA "South America" 64,030,870 "+12,053" 1,329,045 80
## 5 NA "Oceania" 12,346,720 "+1,139" 20,621 NA
## 6 NA "Africa" 12,641,789 "+178" 257,592 NA
## 7 NA "" 721 "" 15 NA
## 8 NA "World" 619,977,801 "+320,701" 6,539,808 699
## 9 1 "China" 249,172 "+188" 5,226 NA
## 10 2 "USA" 97,895,860 "+15,377" 1,081,708 75
## 11 3 "India" 44,568,114 "+4,777" 528,510 23
## 12 4 "France" 35,125,681 "+38,024" 154,887 NA
## 13 5 "Brazil" 34,673,221 "+6,834" 685,837 21
## 14 6 "Germany" 32,952,050 "" 149,458 NA
## 15 7 "S. Korea" 24,594,336 "+29,315" 28,140 63
## 16 8 "UK" 23,621,952 "" 189,919 NA
## 17 9 "Italy" 22,284,812 "+22,360" 176,867 43
## 18 10 "Japan" 20,982,896 "+64,053" 44,262 85
## 19 11 "Russia" 20,746,163 "+51,269" 386,662 111
## 20 12 "Turkey" 16,873,793 "" 101,139 NA
No, although note that China is (at time of writing) listed as the first country, with the rest ordered according to total cases. It is always a good idea to, if possible, check your results against the website. Also bear in mind that webpages are dynamic and can yield different results over time.
We can also use the pipe operator to string all these commands
together. Note the use of .[[i]], which is the operation
‘extract the i-th element’.
covid_table_piped <- read_html('https://www.worldometers.info/coronavirus/') %>% html_nodes('table') %>% .[[2]] %>% html_table()
head(covid_table_piped[,1:6], 4)
## # A tibble: 4 × 6
## `#` `Country,Other` TotalCases NewCases TotalDeaths NewDeaths
## <int> <chr> <chr> <chr> <chr> <int>
## 1 NA Asia 188,953,762 +155,476 1,478,239 319
## 2 NA North America 116,104,166 +16,250 1,536,722 87
## 3 NA Europe 225,899,773 +135,605 1,917,574 213
## 4 NA South America 64,030,870 +12,053 1,329,045 80
We shall now move onto more complex examples.
This is a more advanced example where we scrape data on houses for sale in a particular area of interest off the Property24 website.
The landing page for a suburb shows summaries for the first 20
houses. At the bottom of the page are links to further pages, each
containing 20 house summaries. First we read in the landing page and
identify all hyperlinks on that page. Links are all identified
by CSS selector a. We want to extract the hypertext
reference (href).
# if the link below doesn't work, it may be out-of-date... just find another
suburb <- read_html('https://www.property24.com/for-sale/waterfront/cape-town/western-cape/9169')
suburb_links <- suburb %>% html_nodes('a') %>% html_attr('href')
print(suburb_links[1:6])
## [1] "https://www.microsoft.com/en-us/edge"
## [2] "/"
## [3] "/"
## [4] "/for-sale/waterfront/cape-town/western-cape/9169"
## [5] "/commercial-property-for-sale/waterfront/cape-town/western-cape/9169"
## [6] "/for-sale/waterfront/cape-town/western-cape/9169?sp=r%3dTrue"
Next, we need to identify just those hyperlinks that load pages with house summaries (let’s call these ‘summary pages’). We do this by matching pattern with regular expressions. A regular expression is a sequence of characters that specifies a search pattern in text. We shall dive into more details regarding regular expressions at a later stage.
We use the str_subset function to look for specific
patterns in the url. Specifically one that ends with \(9169\).
suburb_pages <- str_subset(suburb_links,'(http).*(for-sale).*(9169)')
suburb_pages
## [1] "https://www.property24.com/for-sale/waterfront/cape-town/western-cape/9169/p2"
## [2] "https://www.property24.com/for-sale/waterfront/cape-town/western-cape/9169"
## [3] "https://www.property24.com/for-sale/waterfront/cape-town/western-cape/9169/p2"
## [4] "https://www.property24.com/for-sale/waterfront/cape-town/western-cape/9169/p3"
## [5] "https://www.property24.com/for-sale/waterfront/cape-town/western-cape/9169/p4"
We see duplicates. Since “Next” and Page 2 both have same links. We
can remove duplicate links using the unique command. Now in
the case here, we have that only 4 pages are provided on the website for
houses for sale at Waterfront Cape Town.
If we see that there are more pages of interest than could have been accessed through that initial page, we could manually fill in the gaps in our list of suburb pages, although a good way would be to generalise and automate this next bit.
This is shown below. Say we wanted pages 7 to 14 as well.
suburb_pages <- sort(unique(c(suburb_pages, paste0(suburb_pages[2], '/p', 7:14))))
suburb_pages
Now at this stage we have url’s for all the property24 pages for Waterfront in Cape Town. So now, for each of the summary pages, we extract the hyperlinks that lead to the full house ads.
So we shall cycle through all the webpages we have of the properties.
house_links <- c()
for(i in suburb_pages){
suburb_i <- read_html(i)
suburb_i_links <- suburb_i %>% html_nodes('a') %>% html_attr('href')
house_links_i <- str_subset(suburb_i_links,'(for-sale).*(9169/)[0-9]{9}$')
house_links <- c(house_links, house_links_i)
}
# remove any duplicates and reorder
house_links <- sort(unique(house_links))
Now let’s take a look at the links we got. Also the length (how many properties).
house_links[1:6]
## [1] "/for-sale/waterfront/cape-town/western-cape/9169/102066306"
## [2] "/for-sale/waterfront/cape-town/western-cape/9169/105419812"
## [3] "/for-sale/waterfront/cape-town/western-cape/9169/105827693"
## [4] "/for-sale/waterfront/cape-town/western-cape/9169/105922448"
## [5] "/for-sale/waterfront/cape-town/western-cape/9169/106075464"
## [6] "/for-sale/waterfront/cape-town/western-cape/9169/106104293"
paste("Number of links/properties scraped:",length(house_links))
## [1] "Number of links/properties scraped: 78"
We now read each of those pages and extract the following variables on each house:
as well as the ad text.
Note, that unfortunately its quite easy to get blocked (temporarily) by Property24 for making too many requests. Just do 10 here for example.
house_data <- data.frame()
for(i in house_links[1:10]){
# read house ad html
house <- read_html(paste0('https://www.property24.com',i))
# get the ad text
ad <- house %>% html_nodes(css = '.js_readMoreText') %>% html_text(trim = T)
# get house data
price <- house %>% html_nodes(css = '.p24_price') %>% html_text(trim = TRUE) %>% .[[2]]
erfsize <- house %>% html_nodes(css = '.p24_size span')
nbeds <- house %>% html_nodes(css = '.p24_listingFeatures:nth-child(1) .p24_featureAmount') %>% html_text(trim = TRUE) %>% as.numeric()
nbaths <- house %>% html_nodes(css = '.p24_listingFeatures:nth-child(2) .p24_featureAmount') %>% html_text(trim = TRUE) %>% as.numeric()
ngar <- house %>% html_nodes(css = '.p24_listingFeatures:nth-child(3) .p24_featureAmount') %>% html_text(trim = TRUE) %>% as.numeric()
# if couldn't find data on webpage, replace with NA
price <- ifelse(length(price) > 0, price, NA)
erfsize <- ifelse(length(erfsize) > 0, html_text(erfsize, trim = TRUE), NA)
nbeds <- ifelse(length(nbeds) > 0, nbeds, NA)
nbaths <- ifelse(length(nbaths) > 0, nbaths, NA)
ngar <- ifelse(length(ngar) > 0, ngar, NA)
# store results
this_house <- data.frame(price = price, erfsize = erfsize, nbeds = nbeds, nbaths = nbaths, ngar = ngar, ad = ad)
house_data <- rbind.data.frame(house_data,this_house)
# See if random wait between 1 and 3 seconds avoids excessive requesting
Sys.sleep(runif(1, 1, 3))
}
View the data.
house_data[1:5]
## price erfsize nbeds nbaths ngar
## 1 R 13 500 000 146 m² 2 2.5 2
## 2 R 55 000 000 504 m² 3 4.0 3
## 3 R 8 995 000 110 m² 1 1.5 1
## 4 R 29 995 000 221 m² 3 3.5 2
## 5 R 27 594 250 280 m² 4 4.0 4
## 6 R 16 094 250 174 m² 3 2.0 NA
## 7 R 8 995 000 111 m² 2 2.5 2
## 8 R 9 500 000 115 m² 2 2.0 2
## 9 R 8 975 000 112 m² 2 2.5 2
## 10 R 14 500 000 152 m² 2 2.5 2
Nicely done. We were able to scrape data from property24’s site; specifically, we were able to scrape all the data regarding the listings for sale in Waterfront.
In this final example we shall retrieve movie reviews from the links of certain movies provided to use from loading a data set.
load('data/movielens-small.RData') #Contains 9742 movies
load('output/recommender.RData') #Contains our subset of movies
# make into a tibble
links <- as_tibble(links)
head(links)
## # A tibble: 6 × 3
## movieId imdbId tmdbId
## <int> <int> <int>
## 1 1 114709 862
## 2 2 113497 8844
## 3 3 113228 15602
## 4 4 114885 31357
## 5 5 113041 11862
## 6 6 113277 949
The links data frame provides identifiers for each movie for three different movie data sets: MovieLens, IMDb, and The Movie Database. This gives us a way of looking up reviews for a particular movieId we are interested in on either IMDb or The Movie Database.
IMDb links are 7 characters long, so we need to add leading zeros in some cases.
links$imdbId <- sprintf('%07d',links$imdbId)
Let’s extract just the movies that we used to build our recommender systems in the last lesson, and get the IMDB identifiers for those movies. We also shall rename one of the Harry Potter movies, while doing this.
imdbId_to_use <- distinct(ratings_red, movieId, .keep_all = T) %>%
select(movieId, title) %>%
inner_join(links, by = 'movieId') %>%
select(-title, title) %>%
arrange(imdbId) %>%
mutate(title = replace(title, str_detect(title, 'Potter.*Sorc'), 'Harry Potter and the Philosopher\'s Stone (2001)'))
imdbId_to_use %>% head(5)
## # A tibble: 5 × 4
## movieId imdbId tmdbId title
## <int> <chr> <int> <chr>
## 1 924 0062622 62 2001: A Space Odyssey (1968)
## 2 1208 0078788 28 Apocalypse Now (1979)
## 3 1258 0081505 694 Shining, The (1980)
## 4 2115 0087469 87 Indiana Jones and the Temple of Doom (1984)
## 5 2918 0091042 9377 Ferris Bueller's Day Off (1986)
Next we need to know a little more about how reviews are displayed on
IMDb. We see that only a certain number of reviews are displayed by
default, with the option to “load more” at the bottom. At this point we
need to interact with the webpage, for which the RSelenium
package is still recommended. However, RSelenium can now
only be run via Docker, which adds a whole new level of complexity that
takes us beyond a reasonable scope for this tutorial. Therefore, we will
only be scraping the visible reviews.
reviews <- data.frame()
# just get the first two movies to save time
for(j in 1:2){
this_movie <- imdbId_to_use$imdbId[j]
link <- paste0('http://www.imdb.com/title/tt',this_movie,'/reviews?ref_=tt_ql3')
movie_imdb <- read_html(link)
# Used SelectorGadget as the CSS Selector
imdb_review <- movie_imdb %>% html_nodes('.text.show-more__control') %>% html_text()
this_review <- data.frame(imbdId = this_movie, review = imdb_review, stringsAsFactors = F)
reviews <- rbind.data.frame(reviews, this_review)
}
reviews <- as_tibble(reviews)
We’ll now look in a bit more detail on working with text. Let’s look at the first review.
review1 <- as.character(reviews$review[1])
review1
## [1] "Sometimes reading the user comments on IMDB fills me with despair for the species. For anybody to dismiss 2001: A Space Odyssey as \"boring\" they must have no interest in science, technology, philosophy, history or the art of film-making. Finally I understand why most Hollywood productions are so shallow and vacuous - they understand their audience.Thankfully, those that cannot appreciate Kubrick's accomplishment are still a minority. Most viewers are able to see the intelligence and sheer virtuosity that went into the making of this epic. This is the film that put the science in \"science fiction\", and its depiction of space travel and mankind's future remains unsurpassed to this day. It was so far ahead of its time that humanity still hasn't caught up.2001 is primarily a technical film. The reason it is slow, and filled with minutae is because the aim was to realistically envision the future of technology (and the past, in the awe inspiring opening scenes). The film's greatest strength is in the details. Remember that when this film was made, man still hadn't made it out to the moon... but there it is in 2001, and that's just the start of the journey. To create such an incredibly detailed vision of the future that 35 years later it is still the best we have is beyond belief - I still can't work out how some of the shots were done. The film's only notable mistake was the optimism with which it predicted mankind's technological (and social) development. It is our shame that the year 2001 did not look like the film 2001, not Kubrick's.Besides the incredible special effects, camera work and set design, Kubrick also presents the viewer with a lot of food for thought about what it means to be human, and where the human race is going. Yes, the ending is weird and hard to comprehend - but that's the nature of the future. Kubrick and Clarke have started the task of envisioning it, now it's up to the audience to continue. There's no neat resolution, no definitive full stop, because then the audience could stop thinking after the final reel. I know that's what most audiences seem to want these days, but Kubrick isn't going to let us off so lightly.I'm glad to see that this film is in the IMDB top 100 films, and only wish that it were even higher. Stanley Kubrick is one of the very finest film-makers the world has known, and 2001 his finest accomplishment. 10/10."
Although this is readable to us, we would like to clean it up a bit
before we work with it. The first thing we can do is remove all the
punctuation. We do this with a call to str_replace_all()
and a ‘regular expression’, a way of describing patterns in strings.
Here :alnum: refers to any alphanumeric character,
equivalent to [A-Za-z0-9]. In this context ^
means negation, so we’re removing anything that’s not alphanumeric
(replacing it with nothing).
review1_nopunc <- str_replace_all(review1, '[^[:alnum:] ]', '')
Finally we can convert everything to lowercase. Note that there might still be some problems we’d like to fix up, most often when two words get concatenated (e.g. ‘ambitiononly’ in the final sentence). Getting text totally clean can be hard work.
review1_clean <- tolower(review1_nopunc)
review1_clean
## [1] "sometimes reading the user comments on imdb fills me with despair for the species for anybody to dismiss 2001 a space odyssey as boring they must have no interest in science technology philosophy history or the art of filmmaking finally i understand why most hollywood productions are so shallow and vacuous they understand their audiencethankfully those that cannot appreciate kubricks accomplishment are still a minority most viewers are able to see the intelligence and sheer virtuosity that went into the making of this epic this is the film that put the science in science fiction and its depiction of space travel and mankinds future remains unsurpassed to this day it was so far ahead of its time that humanity still hasnt caught up2001 is primarily a technical film the reason it is slow and filled with minutae is because the aim was to realistically envision the future of technology and the past in the awe inspiring opening scenes the films greatest strength is in the details remember that when this film was made man still hadnt made it out to the moon but there it is in 2001 and thats just the start of the journey to create such an incredibly detailed vision of the future that 35 years later it is still the best we have is beyond belief i still cant work out how some of the shots were done the films only notable mistake was the optimism with which it predicted mankinds technological and social development it is our shame that the year 2001 did not look like the film 2001 not kubricksbesides the incredible special effects camera work and set design kubrick also presents the viewer with a lot of food for thought about what it means to be human and where the human race is going yes the ending is weird and hard to comprehend but thats the nature of the future kubrick and clarke have started the task of envisioning it now its up to the audience to continue theres no neat resolution no definitive full stop because then the audience could stop thinking after the final reel i know thats what most audiences seem to want these days but kubrick isnt going to let us off so lightlyim glad to see that this film is in the imdb top 100 films and only wish that it were even higher stanley kubrick is one of the very finest filmmakers the world has known and 2001 his finest accomplishment 1010"
Let’s try now and scrape all the reviews from the IMDb site for Indiana Jones and the Temple of Doom (1984).
# Get info on movie
imdbId_to_use[which(imdbId_to_use$title == "Indiana Jones and the Temple of Doom (1984)"),]
## # A tibble: 1 × 4
## movieId imdbId tmdbId title
## <int> <chr> <int> <chr>
## 1 2115 0087469 87 Indiana Jones and the Temple of Doom (1984)
# initialise
reviews_indianajones <- data.frame()
# movie ID for IMDb
this_movie <- imdbId_to_use$imdbId[4]
# get url
link <- paste0('http://www.imdb.com/title/tt',this_movie,'/reviews?ref_=tt_ql3')
# read html from url
movie_imdb <- read_html(link)
# Used SelectorGadget as the CSS Selector
imdb_review <- movie_imdb %>% html_nodes('.text.show-more__control') %>% html_text()
# put all info and reviews into data frame
this_review <- data.frame(imbdId = this_movie, review = imdb_review, stringsAsFactors = F)
# combine review with other reviews
reviews_indianajones <- rbind.data.frame(reviews_indianajones, this_review)
# convert to tibble
reviews_indianajones <- as_tibble(reviews_indianajones)
reviews_indianajones %>% head(4)
## # A tibble: 4 × 2
## imbdId review
## <chr> <chr>
## 1 0087469 "I know that there are a lot of haters when it comes to Indiana Jones…
## 2 0087469 "It's funny to call \"Indiana Jones and the Temple of Doom\" a follow…
## 3 0087469 "Indiana Jones and the Temple of Doom is the second of the Indy films…
## 4 0087469 "(re-Review): I've never disliked this movie, but it's also been a ha…
Let’s look at a specific review. Say review 13.
reviews_indianajones[13,]$review
## [1] "The adventurer and archaeologist Indiana Jones (Harrison Ford) with his bullwhip wielding and hat will fight against nasty enemies in India along with an oriental little boy (Jonathan Ke Quan) and a night club Singer (Kate Capshaw who married Steven Spielberg). Jones agrees with the village's inhabitants look for a lost magic stone. Meanwhile , they stumble with a secret thug cult ruled by an evil priest (Amrish Puri).The Indiana Jones adventures trilogy tries to continue the vintage pathes from the thirty years ago greatest classics , and the comics-books narrative , along with the special characteristics of the horror films of the 80s decade , as it is well reflected in the creepy and spooky scenes about human sacrifices . The picture is directed with great style and high adventure and driven along with enormous fair-play in the stunning mounted action set-pieces . Harrison Ford plays splendidly the valiant and brave archaeologist turned into an action man .Kate Capshaw interprets a scream girl who'll have a little romance with Indy . The movie blends adventures , noisy action , rip-snorting , humor , tongue-in-chek , it is a cinematic roller coaster ride and pretty bemusing . The motion picture has great loads of action , special effects galore and the usual and magnificent John Williams musical score . The glimmering and dazzling cinematography is efficiently realized by Douglas Slocombe . The pic was allrightly directed by Steven Spielberg. Film culminates in a spectacular finale that will have you on the edge of your seat . It's a must see for adventures aficionados , as perfect entertainment for all the family ."
How many numeric characters are contained in this review?
str_count(reviews_indianajones[13,]$review, "[0-9]")
## [1] 2
Now let’s see if we can scrape all the reviews from the IMDb site for all 20 movies. We shall scrape the reviews that appear without clicking ‘Load More’.
reviews_all_movies <- data.frame()
# just get the first two movies to save time
for(j in 1:20){
this_movie <- imdbId_to_use$imdbId[j]
link <- paste0('http://www.imdb.com/title/tt',this_movie,'/reviews/?ref_=tt_ql_urv')
movie_imdb <- read_html(link)
# Used SelectorGadget as the CSS Selector
imdb_review <- movie_imdb %>% html_nodes('.show-more__control') %>% html_text()
this_review <- data.frame(imbdId = this_movie, review = imdb_review, stringsAsFactors = F)
reviews_all_movies <- rbind.data.frame(reviews_all_movies, this_review)
}
reviews_all_movies <- as_tibble(reviews_all_movies)
reviews_all_movies %>% head(4)
## # A tibble: 4 × 2
## imbdId review
## <chr> <chr>
## 1 0062622 "Sometimes reading the user comments on IMDB fills me with despair fo…
## 2 0062622 ""
## 3 0062622 "\n "
## 4 0062622 "A stand-alone monument in cinema history, Stanley Kubrick's magnum o…
Let’s do some investigating on this now. Let’s remove some reviews that don’t look right. An example is shown below:
# Initialise a df with clean reviews
cleaned_reviews <- data.frame()
# loop to clean reviews
for (i in 1:nrow(reviews_all_movies)){
reviews_all_movies[i, "review"] <- reviews_all_movies[i, "review"] %>% str_trim()
this_review <- reviews_all_movies[i,"review"]
if(nchar(this_review) != 0){
this_movie <- reviews_all_movies[i,1]
this_review <- data.frame(imbdId = this_movie, review = this_review, stringsAsFactors = F)
cleaned_reviews <- rbind.data.frame(cleaned_reviews, this_review)
}
}
# see result
#cleaned_reviews %>% head(4)
# do we got all reviews for all movies?
length(unique(cleaned_reviews$imbdId))
## [1] 20
Having gotten all the reviews without clicking load more for all 20 movies in our data set from IMDb; we shall now briefly explore it. The shortest review, first.
rev_chars <- reviews_all_movies
rev_chars$num_chars <- rep(NA, nrow(rev_chars))
for (i in 1:nrow(rev_chars)){
num_chars <- rev_chars[i,2] %>% str_length()
rev_chars$num_chars[i] <- num_chars
}
# look at result
rev_chars %>% head(4)
## # A tibble: 4 × 3
## imbdId review num_c…¹
## <chr> <chr> <int>
## 1 0062622 "Sometimes reading the user comments on IMDB fills me with de… 2409
## 2 0062622 "" 0
## 3 0062622 "" 0
## 4 0062622 "A stand-alone monument in cinema history, Stanley Kubrick's … 9555
## # … with abbreviated variable name ¹num_chars
# shortest review
rev_chars %>% select(imbdId, num_chars) %>% arrange(num_chars) %>% filter(num_chars>0)
## # A tibble: 497 × 2
## imbdId num_chars
## <chr> <int>
## 1 0258463 70
## 2 0087469 87
## 3 0407887 116
## 4 1049413 117
## 5 0129387 127
## 6 0110148 133
## 7 0078788 150
## 8 0129387 157
## 9 0108160 171
## 10 0091042 181
## # … with 487 more rows
# what movie?
imdbId_to_use %>% filter(imdbId == "0062622" | imdbId == "0407887")
## # A tibble: 2 × 4
## movieId imdbId tmdbId title
## <int> <chr> <int> <chr>
## 1 924 0062622 62 2001: A Space Odyssey (1968)
## 2 48516 0407887 1422 Departed, The (2006)
rev_chars %>% select(everything()) %>% arrange(num_chars) %>% filter(num_chars>0) %>% head(5)
## # A tibble: 5 × 3
## imbdId review num_c…¹
## <chr> <chr> <int>
## 1 0258463 I like the bit where he punched the bloke really hard in the … 70
## 2 0087469 ...is annoying as hell, but otherwise this is a very entertai… 87
## 3 0407887 Amazing performances, action, script, production, and directi… 116
## 4 1049413 It will put you through every emotion that a human can experi… 117
## 5 0129387 This movie is one of the funniest movies ever I have watched … 127
## # … with abbreviated variable name ¹num_chars
Each movie’s review page on IMDb states how many reviews have been posted for that movie. We shall scrape this for all 20 movies.
num_reviews_all_movies <- data.frame()
# just get the first two movies to save time
for(j in 1:20){
this_movie <- imdbId_to_use$imdbId[j]
link <- paste0('http://www.imdb.com/title/tt',this_movie,'/reviews/?ref_=tt_ql_urv')
movie_imdb <- read_html(link)
# Used SelectorGadget as the CSS Selector
imdb_review <- movie_imdb %>% html_nodes('.header span:nth-child(1)') %>% html_text()
this_review <- data.frame(imbdId = this_movie, review = imdb_review, stringsAsFactors = F)
num_reviews_all_movies <- rbind.data.frame(num_reviews_all_movies, this_review)
}
num_reviews_all_movies <- as_tibble(num_reviews_all_movies)
num_reviews_all_movies %>% head(4)
## # A tibble: 4 × 2
## imbdId review
## <chr> <chr>
## 1 0062622 2,466 Reviews
## 2 0078788 1,370 Reviews
## 3 0081505 2,148 Reviews
## 4 0087469 779 Reviews
Now lets keep only the digits (number of reviews), so we shall remove the text.
i= 1
for (i in 1:nrow(num_reviews_all_movies)){
num_reviews_all_movies[i,2] <- str_replace_all(num_reviews_all_movies[i,2], '\\D', '')
}
num_reviews_all_movies$review <- as.numeric(num_reviews_all_movies$review)
names(num_reviews_all_movies)[2] <- "num_reviews"
Now we have aour tibble with the movie id for IMDb and a column indicating the numebr of reviews that movie has; we can sort these to view which has the most.
# which movie has highest num reviews
num_reviews_all_movies %>% arrange(desc(num_reviews)) %>% head(4)
## # A tibble: 4 × 2
## imbdId num_reviews
## <chr> <dbl>
## 1 0407887 2525
## 2 0062622 2466
## 3 0246578 2440
## 4 0081505 2148
# what movie?
imdbId_to_use %>% filter(imdbId == "0407887")
## # A tibble: 1 × 4
## movieId imdbId tmdbId title
## <int> <chr> <int> <chr>
## 1 48516 0407887 1422 Departed, The (2006)
Finally, some of the movie reviews on IMDb also provide ratings out of 10. We shall scrape these for all 20 movies.
ratings_all_movies <- data.frame()
# just get the first two movies to save time
for(j in 1:20){
this_movie <- imdbId_to_use$imdbId[j]
link <- paste0('http://www.imdb.com/title/tt',this_movie,'/reviews/?ref_=tt_ql_urv')
movie_imdb <- read_html(link)
# Used SelectorGadget as the CSS Selector
imdb_review <- movie_imdb %>% html_nodes('.rating-other-user-rating span') %>% html_text()
this_review <- data.frame(imbdId = this_movie, review = imdb_review, stringsAsFactors = F)
ratings_all_movies <- rbind.data.frame(ratings_all_movies, this_review)
}
ratings_all_movies <- as_tibble(ratings_all_movies)
ratings_all_movies %>% head(4)
## # A tibble: 4 × 2
## imbdId review
## <chr> <chr>
## 1 0062622 10
## 2 0062622 /10
## 3 0062622 10
## 4 0062622 /10
This is not exactly what we wanted. Lets try remove that unwanted stuff.
new_rats <- data.frame()
j <- 1
for (i in seq(from = 1, to = nrow(ratings_all_movies), by = 2)){
new_rats <- rbind.data.frame(new_rats, ratings_all_movies[i,])
j <- j + 1
}
new_rats %>% head(4)
## # A tibble: 4 × 2
## imbdId review
## <chr> <chr>
## 1 0062622 10
## 2 0062622 10
## 3 0062622 6
## 4 0062622 7
Lovely. Now we can investigate which movie has the highest average rating and what is that average rating.
# change to numeric
new_rats$review <- as.numeric(new_rats$review)
# find mean and sort
new_rats %>% group_by(imbdId) %>% summarise(mean = mean(review, na.rm = TRUE), n = n()) %>% arrange(desc(mean)) %>% head(4)
## # A tibble: 4 × 3
## imbdId mean n
## <chr> <dbl> <int>
## 1 0120689 9.61 23
## 2 1049413 9.55 22
## 3 0078788 9.33 21
## 4 0258463 8.5 24
# find movie title
imdbId_to_use %>% filter(imdbId=="0120689")
## # A tibble: 1 × 4
## movieId imdbId tmdbId title
## <int> <chr> <int> <chr>
## 1 3147 0120689 497 Green Mile, The (1999)
Click here, for a great walkthrough of using web scraping for tables.
Big websites, like Google or Amazon, are designed to handle high traffic. Smaller sites are not. It’s therefore important that you don’t overload a site with too many HTTP requests, which can slow it down, or even crash it completely. In fact, this is a technique often used by hackers. They flood sites with requests to bring them down, in what’s known as a ‘denial of service’ attack. Make sure you don’t carry one of these out by mistake! Don’t scrape too aggressively, either; include plenty of time intervals between requests, and avoid scraping a site during its peak hours.
Web scraping does bring with it some ethical concerns. Its important to read about these and formulate your own opinion and approach, starting for example here, here, and here.
We can try and apply our learning son the following tasks.
The Freakonomics Radio Archive contains all previous Freakonomics podcasts. Scrape the titles, dates and descriptions, and download URLs of all the podcasts and store them in a dataframe (see if you can download all the medically-themed podcasts).
Decanter magazine provides one of the world’s best known wine ratings. Scrape the tasting notes, scores, and prices for their South African white wines (or whatever subset you choose).