Neural Network Basics?

An introduction to Keras in R

2022-06-11

Overview

This tutorial provides a foundational introduction to neural network models in Keras. The tutorial provides a very brief background into some terminology and/or basics of what a neural network is and some functions. However , for the most part this tutorial considers more the application of basis neural networks in four different examples. Two of each regression and classification problems. As mentioned, in this tutorial we shall mainly be using the Keras library in R.

References

A lot of the work follows suit from content in Data Science for Industry course from university of Cape Town. Other great resources that are used for this tutorial are discussed below.

a great introduction to using Keras in R by DataCamp.

Prerequisites

The document is for the most part very applied in nature, and doesn’t assume much beyond familiarity with the R statistical computing environment. For programming purposes, it would be useful if you are familiar with the tidyverse, or at least dplyr specifically.

It must be stressed that this is only a starting point, a hopefully fun foray into the world of text, not definitive statement of how you should analyze text. In fact, some of the methods demonstrated would likely be too rudimentary for most goals.

A Crash Course on Neural Networks

We shall begin this discussion by looking at the specific terminology used in neural networks, as well as looking at the building blocks of neural networks, including neurons, weights, and activation functions. We look at how these building blocks are used in layers to create networks, and also how these networks are trained.

Multi-Layer Perceptions

A perceptron is a single neuron model that was a precursor to larger neural networks.

The power of neural networks comes from their ability to learn the representation in your training data and how best to relate it to the output variable you want to predict. In this sense, neural networks learn mapping.

Mathematically, they are capable of learning any mapping function and have been proven to be a universal approximation algorithm.

Neurons, Weights, and Activations

Neurons

The building blocks for neural networks are artificial neurons. These are simple computational units that have weighted input signals and produce an output signal using an activation function.

Weights

Weights on the inputs are very much like like the coefficients used in a regression equation. Each neuron also has a bias which can be thought of as an input that always has the value 1.0, and it, too, must be weighted. Weights are often initialized to small random values, such as values from 0 to 0.3, although more complex initialization schemes can be used.

The weighted inputs are summed and passed through an activation function, sometimes called a transfer function.

Activations

An activation function is a simple mapping of summed weighted input to the output of the neuron. It is called an activation function because it governs the threshold at which the neuron is activated and the strength of the output signal. Historically, simple step activation functions were used when the summed input was above a threshold of 0.5, for example. Then the neuron would output a value of 1.0; otherwise, it would output a 0.0.

Traditionally, non-linear activation functions are used. This allows the network to combine the inputs in more complex ways and, in turn, provide a richer capability in the functions they can model.

Networks of Neurons

Neurons are arranged into networks of neurons. A row of neurons is called a layer, and one network can have multiple layers. The architecture of the neurons in the network is often called the network topology.

Input Layer

takes input from your data set and provides it to network. Often a neural network is drawn with a visible layer with one neuron per input value or column in your data set. These are not neurons as described above but simply pass the input value through to the next layer.

Hiiden Layers

Layers after the input layer are called hidden layers because they are not directly exposed to the input. The simplest network structure is to have a single neuron in the hidden layer that directly outputs the value.

Given increases in computing power and efficient libraries, very deep neural networks can be constructed. Deep learning can refer to having many hidden layers in your neural network

Output Layer

The final hidden layer is called the output layer, and it is responsible for outputting a value or vector of values that correspond to the format required for the problem.

The choice of activation function in the output layer is strongly constrained by the type of problem that you are modeling. For example:

A regression problem may have a single output neuron, and the neuron may have no activation function (linear activation function).
A binary classification problem may have a single output neuron and use a sigmoid activation function to output a value between 0 and 1 to represent the probability of predicting a value for the class 1.
A multi-class classification problem may have multiple neurons in the output layer, one for each class (e.g., three neurons for the three classes in the famous iris flowers classification problem). In this case, a softmax activation function may be used to output a probability of the network predicting each of the class values.

Training Networks

Once configured, the neural network needs to be trained on your data set.

Data Pre-proccessing

You must first prepare your data for training on a neural network.

Data must be numerical, for example, real values. If you have categorical data, such as a sex attribute with the values “male” and “female,” you can convert it to a real-valued representation called one-hot encoding. This same one-hot encoding can be used on the output variable in classification problems with more than one class.

Neural networks require the input to be scaled in a consistent way. You can rescale it to the range between 0 and 1, called normalization.

Optimisers

We need to decide on a training algorithm for the network. Common ones are RMSprop, adam and stochastic gradient descent.

When data is exposed to the network at a time as input. The network processes the input upward, activating neurons as it goes to finally produce an output value. This is called a forward pass on the network. It is the type of pass that is also used after the network is trained in order to make predictions on new data.

The output of the network is compared to the expected output, and an error is calculated. This error is then propagated back through the network, one layer at a time, and the weights are updated according to the amount they contributed to the error - using backpropagation.

The process is repeated for all of the examples in your training data. One round of updating the network for the entire training data set is called an epoch. A network may be trained for tens, hundreds, or many thousands of epochs.

Weight Updates

The weights in the network are updated from the errors calculated: \(w = w - \alpha \Delta E\)

The amount that weights are updated is controlled by a configuration parameter called the learning rate (\(\alpha\)). It is also called the step size and controls the step or change made to a network weight for a given error. Often small weight sizes are used, such as 0.1 or 0.01 or smaller.

Prediction

Once a neural network has been trained, it can be used to make predictions.

You make predictions on testing data in order to estimate the skill of the model on unseen data. Predictions are made by providing the input to the network and performing a forward-pass, allowing it to generate an output you can use as a prediction.

Setup

First load the required packages for this notebook. Namely, the Keras library - this is a high-level neural networks API. It was developed with a focus on enabling fast experimentation.

library(tidyverse)
library(keras)
library(tensorflow)

Example 1: Boston Housing Regression

Loading Data

Let’s begin by loading the data set for our problem.

boston_housing <- dataset_boston_housing()

Train and Test Sets

Next, we shall obtain our features and labels from the data.

c(train_data, train_labels) %<-% boston_housing$train
c(test_data, test_labels) %<-% boston_housing$test

Now we have our training and testing data. let’s see the dimensions of them.

paste("Shape of Training set:",dim(train_data))

## [1] "Shape of Training set: 404" "Shape of Training set: 13"

paste("Shape of Testing set:",dim(test_data))

## [1] "Shape of Testing set: 102" "Shape of Testing set: 13"

Next, it is always advised to explore your data. Let’s see a summary of our features. This helps us in deciding whether scaling is required.

summary(train_data)

##        V1                 V2               V3              V4         
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08144   1st Qu.:  0.00   1st Qu.: 5.13   1st Qu.:0.00000  
##  Median : 0.26888   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.74511   Mean   : 11.48   Mean   :11.10   Mean   :0.06188  
##  3rd Qu.: 3.67481   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##        V5               V6              V7               V8        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4530   1st Qu.:5.875   1st Qu.: 45.48   1st Qu.: 2.077  
##  Median :0.5380   Median :6.199   Median : 78.50   Median : 3.142  
##  Mean   :0.5574   Mean   :6.267   Mean   : 69.01   Mean   : 3.740  
##  3rd Qu.:0.6310   3rd Qu.:6.609   3rd Qu.: 94.10   3rd Qu.: 5.118  
##  Max.   :0.8710   Max.   :8.725   Max.   :100.00   Max.   :10.710  
##        V9              V10             V11             V12        
##  Min.   : 1.000   Min.   :188.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.23   1st Qu.:374.67  
##  Median : 5.000   Median :330.0   Median :19.10   Median :391.25  
##  Mean   : 9.441   Mean   :405.9   Mean   :18.48   Mean   :354.78  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.16  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##       V13       
##  Min.   : 1.73  
##  1st Qu.: 6.89  
##  Median :11.39  
##  Mean   :12.74  
##  3rd Qu.:17.09  
##  Max.   :37.97

Scaling Data

Neural networks work a lot better when dealing with data between 0 and 1. As such we scale our data. It’s often a good idea to scale data so all variables lie on nearly the same scale. If variables have very different ranges, a one-unit change in one variable might represent a huge change (say on a probability scale), while the same one-unit change might represent a tiny change (say if units are metres and distances are very large). One way to scale variables is to divide each variable by its mean and divide by its standard deviation.

Note that means and standard deviation’s should always come from the training set, even if scaling the validation and test sets. Otherwise we are using information in the test set in our model building, which we shouldn’t do. Most of the time if observations have been randomly allocated to training and test sets it won’t make much difference (because the variable means and standard deviations will be similar in training and test sets), but we should do the right thing.

The scale function stores means and standard deviations as attributes of the scaled object, so we can extract these and use them to scale variables in the validation and test data sets. We do this below and view an updated summary.

train_data <- scale(train_data)

apply(train_data, 2, mean) # mean should be 0
apply(train_data, 2, sd) # sd should be 1
attributes(train_data) # previous means and sds used to scale stored here

we note that we don’t have to scale the targets, only the input features.

train_data <- scale(train_data) 
summary(train_data)

##        V1                  V2                 V3                V4         
##  Min.   :-0.404599   Min.   :-0.48302   Min.   :-1.5628   Min.   :-0.2565  
##  1st Qu.:-0.396470   1st Qu.:-0.48302   1st Qu.:-0.8771   1st Qu.:-0.2565  
##  Median :-0.376186   Median :-0.48302   Median :-0.2077   Median :-0.2565  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.007608   3rd Qu.: 0.04291   3rd Qu.: 1.0271   3rd Qu.:-0.2565  
##  Max.   : 9.223411   Max.   : 3.72437   Max.   : 2.4423   Max.   : 3.8888  
##        V5                V6                 V7                V8         
##  Min.   :-1.4694   Min.   :-3.81252   Min.   :-2.3661   Min.   :-1.2859  
##  1st Qu.:-0.8897   1st Qu.:-0.55275   1st Qu.:-0.8423   1st Qu.:-0.8192  
##  Median :-0.1650   Median :-0.09662   Median : 0.3396   Median :-0.2945  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6279   3rd Qu.: 0.48172   3rd Qu.: 0.8980   3rd Qu.: 0.6786  
##  Max.   : 2.6740   Max.   : 3.46289   Max.   : 1.1091   Max.   : 3.4331  
##        V9               V10               V11               V12         
##  Min.   :-0.9704   Min.   :-1.3097   Min.   :-2.6704   Min.   :-3.7664  
##  1st Qu.:-0.6255   1st Qu.:-0.7627   1st Qu.:-0.5685   1st Qu.: 0.2113  
##  Median :-0.5105   Median :-0.4562   Median : 0.2836   Median : 0.3875  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6738   3rd Qu.: 1.5633   3rd Qu.: 0.7835   3rd Qu.: 0.4396  
##  Max.   : 1.6738   Max.   : 1.8338   Max.   : 1.6015   Max.   : 0.4475  
##       V13         
##  Min.   :-1.5178  
##  1st Qu.:-0.8065  
##  Median :-0.1855  
##  Mean   : 0.0000  
##  3rd Qu.: 0.5999  
##  Max.   : 3.4777

Model Building and Training

Let’s define our model.The firs thing to do is call keras_model_sequential. This allows us to build models that enable layers to be stacked upon each other.

model <- keras_model_sequential() %>%
    layer_dense(units = 64, activation = "relu",input_shape = dim(train_data)[2]) %>%
    layer_dense(units = 64, activation = "relu") %>%
    layer_dense(units = 1)

The first layer here, is a fully connected layer. It has 64 units, ReLu activation function, and the input shape is the dimensions of the training data (columns or features in training data). So since we have 13 features, we then have 13 inputs.For the subsequent layer (layer 2) we don’t need to define input shape. We specify ReLu activation and 64 units as well. In the final layer, we specify units of 1, and no specification for activation function is made. The default activation function is the linear function.

Note typically in the first layer it is not advised to sue less that 13 units. And also with respect to the number of units, people tend to sue the number of units to the power of 2.

Next we can compile it. here we need to tell the model, what loss function to us, what optimizer to use and what metric we want. We decided t use mean squared error: \(\frac{1}{n}\sum(\hat{y}-y)^2\). With respect tot optimiser’s, popular ones include stochastic gradient descent adam and rmsprop.What is nice about optimiser’s like RMSprop and Adam, is that we do not have to specify the learning rate - they will use a default value and then try to find an optimal value. Metric is just for us, the model doesn’t use it all in terms of training and fitting. Mean absolute error is given as \(\frac{1}{m}\sum(\hat{y}-y)\).

model %>% compile(
    loss = "mse",
    optimizer = optimizer_rmsprop(),
    metrics = list("mean_absolute_error")
  )

Now we are set for training. We call the model, model in this case, and the fit function. We need to specify the training data, the training labels, as well as the epochs (how many times will the network see the data). We also specify the batch size, which tells the network that its going to read 5 examples at a time, to compute all those weight updates. The weight updates are specified as: \(w = w-\alpha\Delta E\). That s, the new wight is the old weight minus some learning rate times that change in error (partial derivative computed through backpropagation). We also set shuffle to TRUE, since sometimes there may be some intrinsic order in the data that we don’t want.

history <- model %>% fit(
  train_data, train_labels, 
  epochs = 50, batch_size = 5, 
  validation_split = 0.2, shuffle = TRUE
)

Training Performance

When calling the fit function, Keras provides feedback of what happens to the loss during training. This is useful in determining if the model was over-fitting for example. We plot the training performance below.

plot(history)

Initially, the training and validation are quite high, as the epochs increase these values are decreased. We see that training is below validation and seems to stay that way. This is an indication of overfitting.

Predictions

Having done this we can proceed with making predictions on the testing data - unseen data. Since we trained the model on scaled data we will have to scale our testing data as well.

test_predictions <- model %>% predict(scale(test_data))

Let’s take a peak at the predictions and correct values.

round(test_predictions[ , 1][0:15],2)

##  [1]  6.46 19.99 22.43 26.51 25.41 22.69 28.37 21.96 20.02 19.14 19.89 17.39
## [13] 15.01 42.61 16.24

test_labels[0:15]

##  [1]  7.2 18.8 19.0 27.0 22.2 24.5 31.2 22.9 20.5 23.2 18.6 14.5 17.8 50.0 20.8

Now we introduce something new, this allows us to know when an epoch has completed. In this case, the epoch number is printed on each even numbered epoch. This is essentially just to show you the progress in training.

print_dot_callback <- callback_lambda(
  on_epoch_end = function(epoch, logs) {
    if (epoch %% 2 == 0) cat(epoch, '\n')
  }
)

Note that 100 epochs may not be performed since we have added early stopping.

history <- model %>% fit(
  train_data,
  train_labels,
  epochs = 100,
  validation_split = 0.2,
  verbose = 0,
  callbacks = list(print_dot_callback))

## 0 
## 2 
## 4 
## 6 
## 8 
## 10 
## 12 
## 14 
## 16 
## 18 
## 20 
## 22 
## 24 
## 26 
## 28 
## 30 
## 32 
## 34 
## 36 
## 38 
## 40 
## 42 
## 44 
## 46 
## 48 
## 50 
## 52 
## 54 
## 56 
## 58 
## 60 
## 62 
## 64 
## 66 
## 68 
## 70 
## 72 
## 74 
## 76 
## 78 
## 80 
## 82 
## 84 
## 86 
## 88 
## 90 
## 92 
## 94 
## 96 
## 98

Let’s plot the training performance.

plot(history)

Evaluations

Finally we shall evaluate our model. This tells us the loss and mean absolute error - the metrics we specified we wanted. the evaluate function essentially does the predict for us.

model %>% evaluate(scale(test_data), test_labels, verbose = 0)

##                loss mean_absolute_error 
##           16.864580            2.803717

Example 2: Iris Classification

Load Data

iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE)

Let’s return the first part of iris.

head(iris)

##    V1  V2  V3  V4          V5
## 1 5.1 3.5 1.4 0.2 Iris-setosa
## 2 4.9 3.0 1.4 0.2 Iris-setosa
## 3 4.7 3.2 1.3 0.2 Iris-setosa
## 4 4.6 3.1 1.5 0.2 Iris-setosa
## 5 5.0 3.6 1.4 0.2 Iris-setosa
## 6 5.4 3.9 1.7 0.4 Iris-setosa

We have four input features. And we have the class shown in the far right column. We have three types of classes and 150 examples. This is shown below in the structure.

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ V1: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ V2: num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ V3: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ V4: num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ V5: chr  "Iris-setosa" "Iris-setosa" "Iris-setosa" "Iris-setosa" ...

Next up, dimensions. It’s key to look at the data before we do any prepossessing and modelling. We just assign variable names to features, to help make it make sense. We finally, just show the cleaned up data frame.

dim(iris)

## [1] 150   5

names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")
iris <- as.data.frame(iris)
iris %>% head(5)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width     Species
## 1          5.1         3.5          1.4         0.2 Iris-setosa
## 2          4.9         3.0          1.4         0.2 Iris-setosa
## 3          4.7         3.2          1.3         0.2 Iris-setosa
## 4          4.6         3.1          1.5         0.2 Iris-setosa
## 5          5.0         3.6          1.4         0.2 Iris-setosa

Visualise Data

Let’s view the data by plotting it out.

plot(iris$Petal.Length, 
     iris$Petal.Width, 
     pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], 
     xlab="Petal Length", 
     ylab="Petal Width")

We see there is some overlap between the classes.

We need to convert the target values into something more suitable. Currently they are strings. They needed to be numbers. We specify -1 as we want the classes starting from 0.

numerical_target <- factor(iris[,5]) 
iris[,5] <- as.numeric(numerical_target) -1

We shall turn iris into a matrix. and check the dimensions.

iris <- as.matrix(iris)
dim(iris)

## [1] 150   5

Scaling Data

Normalize the iris data using scale function.

iris_features <- scale(iris[,1:4])
iris_target <- iris[,5]

Now let’s return the summary of iris features.

summary(iris_features)

##   Sepal.Length       Sepal.Width       Petal.Length      Petal.Width     
##  Min.   :-1.86378   Min.   :-2.4308   Min.   :-1.5635   Min.   :-1.4396  
##  1st Qu.:-0.89767   1st Qu.:-0.5858   1st Qu.:-1.2234   1st Qu.:-1.1776  
##  Median :-0.05233   Median :-0.1245   Median : 0.3351   Median : 0.1328  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.67225   3rd Qu.: 0.5674   3rd Qu.: 0.7602   3rd Qu.: 0.7880  
##  Max.   : 2.48370   Max.   : 3.1043   Max.   : 1.7804   Max.   : 1.7052

We see the data ranges are a lot more consistent. And return the summary of iris target. This just confirms that the values are 0, 1 and 2.

summary(iris_target)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       1       1       2       2

Train and Test Data

We split our data into training and testing data. We used a 67/33 split. So our training data has 67% of the data observations and the testing set has the remaining 33%.

# Determine sample size
ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.67, 0.33))

# Split the `iris` data
x_train <- iris_features[ind==1, 1:4]
x_test <- iris_features[ind==2, 1:4]

# Split the class attribute
y_train <- iris_target[ind==1]
y_test <- iris_target[ind==2]

One-Hot Encoding

When e want our network output to give probabilities, we know that we need to one hot encode our original target. As such, here we shall convert targets/labels to their one-hot encoded equivalent

y_train <- to_categorical(y_train)
y_test_original = y_test
y_test <- to_categorical(y_test)

Model Building

We just check dimensions of targets. This helps infer building and specifying model.

dim(y_train)

## [1] 112   3

dim(y_test)

## [1] 38  3

Let’s build that model. We call use the sequential model. We essentially just stacking layers. We are adding fully connected layers, aka dense layers. Note that it’s always good to keep checking the API on Keras website, for more details about he default values and arguments in these layers. We note that the default activation function for the dense layer, it is linear activation function that is applied.

Again the first layer, we have to specify the input shape of the data. The Dropout layer, layer_dropout, randomly sets input units to 0 with a frequency of rate at each step during training time, which helps prevent overfitting. So rate is afloat between 0 and 1 and represents the fraction of the input units to drop.

model <- keras_model_sequential() 
model %>% 
    layer_dense(units = 8, activation = 'relu', input_shape = c(4)) %>% 
    layer_dropout(rate = 0.5) %>%
    layer_dense(units = 3, activation = 'softmax')

Note a high dropout means we probably cutting out a a lot of weights. We need to specify 3 units in output layer, since we have 3 classes. Since its a classification problem we choose softmax - it makes sense to use this over sigmoid, since its not a binary case and is a multi-class classification problem.

We can print out a summary of the network architecture. This essentially just gives us a high level overview and output of each function. Params gives us the number of parameters that are trained for each layer - \(weights + biases\). The total number of parameters is shown at the end, which is equal to the number of trainable and non-trainable parameters.

summary(model)

## Model: "sequential_1"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  dense_4 (Dense)                    (None, 8)                       40          
##  dropout (Dropout)                  (None, 8)                       0           
##  dense_3 (Dense)                    (None, 3)                       27          
## ================================================================================
## Total params: 67
## Trainable params: 67
## Non-trainable params: 0
## ________________________________________________________________________________

We shall now compile our model. We need to provide extra information to train the model. We need to specify the loss function, the optimiser and what metric to display to the user. We in a classification problem, as such categorical cross entropy is appropriate. We choose adam optimiser here, and specify a learning rate of 0.1.

model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_adam(lr = 0.01),
  metrics = c('accuracy'),
)

Let’s get to training the neural network. We call the fit function. Again we specify, epochs, validation split and batch size. We also need to tell the network which is the training data and the labels. Remember batch size is the amount of data that gets loaded into memory for updates. So here we will load 5, then another 5, then another 5, until we seen all data - this will represent one epoch.

history <- model %>% fit(
  x_train, y_train, 
  epochs = 300, batch_size = 5, 
  validation_split = 0.2, shuffle = TRUE, verbose = 0
)

Training Performance

When calling the fit function, Keras provides feedback of what happens to the loss during training. This is useful in determining if the model was over-fitting for example.

plot(history)

Evaluate the Performance

model %>% evaluate(x_test, y_test)

##       loss   accuracy 
## 0.08418372 0.94736844

We evaluate We get a good, high accuracy of 94%.

We can also get the confusion matrix, since it is a classification problem.

# update to video: predict_classes deprecated from tensorflow >= v2.6
Y_test_hat <- model %>% predict(x_test) %>% k_argmax() %>% as.numeric()
table(y_test_original, Y_test_hat)

##                Y_test_hat
## y_test_original  0  1  2
##               0 12  0  0
##               1  0 16  0
##               2  0  2  8

A confusion matrix is a table that is used to define the performance of a classification algorithm. A confusion matrix visualizes and summarizes the performance of a classification algorithm.

Example 3: Car Sales Regression

Load Data

We start by reading in our data as a csv file. We also quickly view the first 10 rows of the data.

sales <- read.csv(url("https://raw.githubusercontent.com/MGCodesandStats/datasets/master/cars.csv"), header = TRUE)
head(sales, 10)

##    age gender miles  debt income sales
## 1   28      0    23     0   4099   620
## 2   26      0    27     0   2677  1792
## 3   30      1    58 41576   6215 27754
## 4   26      1    25 43172   7626 28256
## 5   20      1    17  6979   8071  4438
## 6   58      1    18     0   1262  2102
## 7   44      1    17   418   7017  8520
## 8   39      1    28     0   3282   500
## 9   44      0    24 48724   9980 22997
## 10  46      1    46 57827   8163 26517

Note that Sales is the target variables. So in this problem we are trying to predict car sales using these 5 features.

str(sales)

## 'data.frame':    963 obs. of  6 variables:
##  $ age   : int  28 26 30 26 20 58 44 39 44 46 ...
##  $ gender: int  0 0 1 1 1 1 1 1 0 1 ...
##  $ miles : int  23 27 58 25 17 18 17 28 24 46 ...
##  $ debt  : int  0 0 41576 43172 6979 0 418 0 48724 57827 ...
##  $ income: int  4099 2677 6215 7626 8071 1262 7017 3282 9980 8163 ...
##  $ sales : int  620 1792 27754 28256 4438 2102 8520 500 22997 26517 ...

Given the small size of the data, we know we shouldn’t create a very complex network model. Let’s check a summary of the data.

summary(sales)

##       age            gender          miles           debt           income     
##  Min.   :19.00   Min.   :0.000   Min.   :10.0   Min.   :    0   Min.   :    0  
##  1st Qu.:27.00   1st Qu.:0.000   1st Qu.:20.0   1st Qu.: 1475   1st Qu.: 3506  
##  Median :37.00   Median :1.000   Median :25.0   Median : 6236   Median : 6360  
##  Mean   :37.97   Mean   :0.513   Mean   :27.7   Mean   :14109   Mean   : 6176  
##  3rd Qu.:49.00   3rd Qu.:1.000   3rd Qu.:32.0   3rd Qu.:16686   3rd Qu.: 8650  
##  Max.   :60.00   Max.   :1.000   Max.   :97.0   Max.   :59770   Max.   :11970  
##      sales      
##  Min.   :  500  
##  1st Qu.: 3554  
##  Median : 9130  
##  Mean   :11690  
##  3rd Qu.:19245  
##  Max.   :29926

We have some very small and large numbers. So it is suggested to scale our features.

Scale Data

In feature based problems, we often need to do scaling.

# Max-Min Normalization
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}
maxmindf <- as.data.frame(lapply(sales, normalize))
attach(maxmindf)
head(maxmindf, 10)

##           age gender      miles        debt    income       sales
## 1  0.21951220      0 0.14942529 0.000000000 0.3424394 0.004078026
## 2  0.17073171      0 0.19540230 0.000000000 0.2236424 0.043906749
## 3  0.26829268      1 0.55172414 0.695599799 0.5192147 0.926187725
## 4  0.17073171      1 0.17241379 0.722302158 0.6370927 0.943247468
## 5  0.02439024      1 0.08045977 0.116764263 0.6742690 0.133827228
## 6  0.95121951      1 0.09195402 0.000000000 0.1054302 0.054441650
## 7  0.60975610      1 0.08045977 0.006993475 0.5862155 0.272548087
## 8  0.48780488      1 0.20689655 0.000000000 0.2741855 0.000000000
## 9  0.60975610      0 0.16091954 0.815191568 0.8337510 0.764527968
## 10 0.65853659      1 0.41379310 0.967492053 0.6819549 0.884150071

After scaling we see the values are in similar ranges. Neural network will probably do a better job using such data.

View and Cleaning

We shall remove the header from the data frame and convert into a matrix. This is general pre-processing.

names(maxmindf) <- NULL
sales = data.matrix(maxmindf)

Always great to get to know the shape of data.

dim(sales)

## [1] 963   6

Features, Labels, Train and Test Sets

Now we shall split the data into features and targets.

sales_features <- sales[,1:5]
sales_target <- sales[,6]

Now we can generate our training and testing data sets. We used a 70/30 split.

# Determine sample size
ind <- sample(2, nrow(sales), replace=TRUE, prob=c(0.70, 0.30))

# Split the data
x_train <- sales_features[ind==1, 1:4]
x_test <- sales_features[ind==2, 1:4]

# Split the class attribute
y_train <- sales_target[ind==1]
y_test <- sales_target[ind==2]

Note, its common practice to have four variables.

x_train and y_train
x_test and y_test

Model Building

Let’s conduct some model building, where we can generate our layers and specify inputs, units etc.

dim(x_train)[2]

## [1] 4

Now using our knowledge about our input shape. We can specify this in the input layer. We should always start with a simple model, as shown below.

simple_model <- keras_model_sequential() %>%
    layer_dense(units = 4, activation = "relu",input_shape = dim(x_train)[2]) %>%
    layer_dense(units = 1)

Then we can make more complex models - shown below is an example of this.

model <- keras_model_sequential() %>%
    layer_dense(units = 16, activation = "relu",input_shape = dim(x_train)[2]) %>%
    layer_dense(units = 8, activation = "relu") %>%
    layer_dense(units = 4,activation = "relu") %>%
    layer_dense(units = 1)

For the first layer, we have 16 units. We have 4 inputs, so typically we shouldn’t have less units than number of inputs. We have two hidden layers, where we have 8 and 4 units respectively. Both these layers use a ReLu activation function. In output layer we using linear activation function (default).

Now we can compile our model.

simple_model %>% compile(
    loss = "mse",
    optimizer = 'adam',
    metrics = list("mean_absolute_error")
  )

We set the epochs to 220 initially. Our validation split is smaller than usual, only 10% of the training set - since our data is relatively small.

history <- simple_model %>% fit(
  x_train, y_train, 
  epochs = 220, batch_size = 50, 
  validation_split = 0.1, shuffle = TRUE
)

Training Performance

plot(history)

At this stage we usually go back and tweak our model, compile it and train again. This process is repeated until satisfactory results are achieved - if possible.

Evaluations

simple_model %>% evaluate(x_test, y_test)

##                loss mean_absolute_error 
##          0.01918746          0.11168385

Example 4: MNIST Classification

MNIST is a data set of handwritten characters. For this example, instead of treating it as an image based classification problem ,we shall rather treat it as a feature based classification problem. The data set is found in the Keras library as well.

In this example follow the very similar steps we have been doing in the previous examples. As such we do not provide a lot of annotations or comments, unless deemed necessary.

Load dataset

Get data and features, as well as targets.

mnist <- dataset_mnist()

The data has the training and testing data

str(mnist)

## List of 2
##  $ train:List of 2
##   ..$ x: int [1:60000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ y: int [1:60000(1d)] 5 0 4 1 9 2 1 3 1 4 ...
##  $ test :List of 2
##   ..$ x: int [1:10000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ y: int [1:10000(1d)] 7 2 1 0 4 1 4 9 5 9 ...

We shall then set these corresponding data into our four different variables. So then we end up with x_train, x_test, y_train and y_test.

x_train <- mnist$train$x
x_test <- mnist$test$x

y_train <- mnist$train$y
y_test <- mnist$test$y

View Data

paste("Shape of Training Data:", dim(x_train))

## [1] "Shape of Training Data: 60000" "Shape of Training Data: 28"   
## [3] "Shape of Training Data: 28"

paste("Shape of Testing Data:", dim(x_test))

## [1] "Shape of Testing Data: 10000" "Shape of Testing Data: 28"   
## [3] "Shape of Testing Data: 28"

head(x_test)
head(x_train)

Will likely have to do one hot encoding.

Reshape the Data

We reshape the values to convert the image into a vector.

Each image is originally \(28\times28\times1\) (the last \(\times1\) is due to the fact that the image is grey scale). So each \(28\times28\) image can be converted into a vector of length 784 (\(28\times28 = 784\)).

So the original training data was (60 000, 28, 28). We reshape it to (60 000, 784).

c(nrow(x_test))

## [1] 10000

x_train <- array_reshape(x_train, c(nrow(x_train), 784))
x_test <- array_reshape(x_test, c(nrow(x_test), 784))

Just to confirm.

paste("Shape of Training Data:", dim(x_train))

## [1] "Shape of Training Data: 60000" "Shape of Training Data: 784"

paste("Shape of Testing Data:", dim(x_test))

## [1] "Shape of Testing Data: 10000" "Shape of Testing Data: 784"

Rescale Data

dim(x_train)

## [1] 60000   784

dim(x_test)

## [1] 10000   784

x_train <- x_train / 255
x_test <- x_test / 255

One-Hot Encoding

y_train <- to_categorical(y_train, 10)
y_test <- to_categorical(y_test, 10)
dim(y_train)

## [1] 60000    10

Create Model

model <- keras_model_sequential()

We can add a number of layers and activation functions to the model. We start off by adding model %>% then append each new layer (with it’s activation function) on a new line.

The first layer has one extra thing which the others do not have. For the first layer we specify the input shape, which denotes the shape of the data input. In this case the shape of the input is just a vector of length 784, and thus we add input_shape = c(784).

First we start off with a simple model with two layers then we will add more complexity to the model.

In each case the line model <- keras_model_sequential() is added otherwise we will keep on adding layers to the first instance of the variable model and create a massive model.

model <- keras_model_sequential() 
model %>% 
  layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>% 
  layer_dense(units = 10, activation = 'softmax')

Next, we add an extra layer so the model can learn more complexities.

model <- keras_model_sequential() 
model %>% 
  layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>% 
  layer_dense(units = 128, activation = 'relu') %>%
  layer_dense(units = 10, activation = 'softmax')

Next, we add dropout to the first hidden layer.

model <- keras_model_sequential() 
model %>% 
  layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>% 
  layer_dropout(rate = 0.4) %>% 
  layer_dense(units = 128, activation = 'relu') %>%
  layer_dense(units = 10, activation = 'softmax')

Finally, we add dropout to the next hidden layer.

model <- keras_model_sequential() 
model %>% 
  layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>% 
  layer_dropout(rate = 0.4) %>% 
  layer_dense(units = 128, activation = 'relu') %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 10, activation = 'softmax')

Print out a summary of the network architecture.

summary(model)

## Model: "sequential_8"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  dense_21 (Dense)                   (None, 256)                     200960      
##  dropout_3 (Dropout)                (None, 256)                     0           
##  dense_20 (Dense)                   (None, 128)                     32896       
##  dropout_2 (Dropout)                (None, 128)                     0           
##  dense_19 (Dense)                   (None, 10)                      1290        
## ================================================================================
## Total params: 235,146
## Trainable params: 235,146
## Non-trainable params: 0
## ________________________________________________________________________________

On our training examples, we are roughly fitting over 200 000 weights.

Compile & Train Model

We need to provide extra information to train the model. We need to specify the loss function, the optimiser and what metric to display to the user.

We use categorical cross entropy: \(-\sum{y_i}\ln(\hat{y}_i)\)

model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_rmsprop(),
  metrics = c('accuracy')
)

history <- model %>% fit(
  x_train, y_train, 
  epochs = 10, batch_size = 128, 
  validation_split = 0.2
)

Training Performance

When calling the fit function, Keras provides feedback of what happens to the loss during training. This is useful in determining if the model was over-fitting for example.

plot(history)

Evluations

Evaluate the performance on the test data.

model %>% evaluate(x_test, y_test)

##       loss   accuracy 
## 0.07894263 0.97810000

Prediction

To predict we can either predict on an entire matrix or on a subset. In the case of a subset, you need to make sure that the correct dimensions are used as the network has certain input expectations. In this case, the model expects data in this format: [batches, 784]. So you can send any number of batches of data to the network.

dim(x_test)

## [1] 10000   784

Here we want to predict on the first 10 test examples. But just using x_test[0:10] would result in the incorrect dimension. So we need to reshape the data.

subset <- array_reshape(array(x_test[0:10,]), c(10, 784))

Here we check the dimensions.

dim(subset)

## [1]  10 784

Finally, we can predict on the 10 first examples.

# update to video: predict_classes deprecated from tensorflow >= v2.6
model %>% predict(subset) %>% k_argmax() %>% as.numeric()

##  [1] 8 7 2 6 5 5 7 2 6 6

Here we predict on all of the x_testdata.

model %>% predict(x_test) %>% k_argmax() %>% as.numeric()%>%head(10)

##  [1] 7 2 1 0 4 1 4 9 6 9

Summary

This was a short introduction to deep learning in R with Keras. We looked at four different examples, two regression problems and two classification problems. We covered the basis in exploring and preprocessing of the data. After this, we looked at constructing a deep learning model; in these cases, we built a Multi-Layer Perceptron (MLP) for multi-class classification and for regression problems - predicting a numerical output. We then proceeded, with looking at how we can compile and fit the model to your data, how we can visualize the training history. Finally, we looked at evaluating the model by predicting target values based on test data.

Data Loading

Data Pre-proccesing

Before you can build your model, you also need to make sure that your data is cleaned, normalized (if applicable) and divided into training and test sets. At first sight, when you inspected the data with head(), you didn’t really see anything out of the ordinary, right? Let’s make use of summary() and str(). We want to work with features which have values in small ranges, preferably between 0 and 1.

Data Splits

After we have checked the quality of your data and we know that it is or is not necessary to scale your data, we can continue to work with the original data and split it into training and test sets so that we’re finally ready to start building your model. By doing this, you ensure that we can make honest assessments of the performance of our predicted model afterwards.

One-Hot Encoding (Classification Problems)

When you want to model multi-class classification problems with neural networks, it is generally a good practice to make sure that you transform your target attribute from a vector that contains values for each class value to a matrix with a Boolean for each class value and whether or not a given instance has that class value or not. Using the to_categorical() function in Keras will perform one-hot encoding.

Construct, Compile and Train Model

To start constructing a model, you should first initialize a sequential model with the help of the keras_model_sequential() function. Then, you’re ready to start modeling. We can then add layers, specify units in each layer, and activation functions.

Once we set up the architecture of your model, it’s time to compile and fit the model to the data. To compile your model, you configure the model with a certain optimser. We looked at adam and RMSprop, there are of course others. Some of the most popular optimization algorithms used are the Stochastic Gradient Descent (SGD), ADAM and RMSprop. Depending on whichever algorithm you choose, you’ll need to tune certain parameters, such as learning rate or momentum. The choice for a loss function depends on the task that you have at hand. Having said that, we must also specify the loss function - for regression we use mse and for classification we use categorical_crossentropy. Additionally, we must also provide a metric to assess model whilst training. The accuracy during the training by passing accuracy is appropriate for classification problems and the mean_absolute_error is good for regression problems.

Assessing Training Perforamnce

Using the plot function, we can visualise the performance of the training process for the specified network. If your training data accuracy keeps improving while your validation data accuracy gets worse, you are probably overfitting: your model starts to just memorize the data instead of learning from it. If the trend for accuracy on both data sets is still rising for the last few epochs, you can clearly see that the model has not yet over-learned the training data set.

Fine Tuning Model

Fine-tuning your model is probably something that you’ll be doing a lot, especially in the beginning. There are already two key decisions that you’ll probably want to adjust: how many layers you’re going to use and how many “hidden units” you will choose for each layer. Besides playing around with the number of epochs or the batch size, there are other ways in which you can tweak your model in the hopes that it will perform better: by adding layers, by increasing the number of hidden units and by passing your own optimization parameters to the compile() function

Predictions

Now that your model is created, compiled and has been fitted to the data, it’s time to actually use your model to predict the labels or target values for your test set. As you might have expected, you can use the predict() function to do this.

For classification problems, you can print out the confusion matrix to check out the predictions and the real labels.