Overview
This tutorial provides a foundational introduction to neural network models in Keras. The tutorial provides a very brief background into some terminology and/or basics of what a neural network is and some functions. However , for the most part this tutorial considers more the application of basis neural networks in four different examples. Two of each regression and classification problems. As mentioned, in this tutorial we shall mainly be using the Keras library in R.
References
A lot of the work follows suit from content in Data Science for Industry course from university of Cape Town. Other great resources that are used for this tutorial are discussed below.
- a great introduction to using Keras in R by DataCamp.
Prerequisites
The document is for the most part very applied in nature, and doesn’t
assume much beyond familiarity with the R statistical computing
environment. For programming purposes, it would be useful if you are
familiar with the tidyverse, or at least dplyr
specifically.
It must be stressed that this is only a starting point, a hopefully fun foray into the world of text, not definitive statement of how you should analyze text. In fact, some of the methods demonstrated would likely be too rudimentary for most goals.
A Crash Course on Neural Networks
We shall begin this discussion by looking at the specific terminology used in neural networks, as well as looking at the building blocks of neural networks, including neurons, weights, and activation functions. We look at how these building blocks are used in layers to create networks, and also how these networks are trained.
Multi-Layer Perceptions
A perceptron is a single neuron model that was a precursor to larger neural networks.
The power of neural networks comes from their ability to learn the representation in your training data and how best to relate it to the output variable you want to predict. In this sense, neural networks learn mapping.
Mathematically, they are capable of learning any mapping function and have been proven to be a universal approximation algorithm.
Neurons, Weights, and Activations
Neurons
The building blocks for neural networks are artificial neurons. These are simple computational units that have weighted input signals and produce an output signal using an activation function.
Weights
Weights on the inputs are very much like like the coefficients used in a regression equation. Each neuron also has a bias which can be thought of as an input that always has the value 1.0, and it, too, must be weighted. Weights are often initialized to small random values, such as values from 0 to 0.3, although more complex initialization schemes can be used.
The weighted inputs are summed and passed through an activation function, sometimes called a transfer function.
Activations
An activation function is a simple mapping of summed weighted input to the output of the neuron. It is called an activation function because it governs the threshold at which the neuron is activated and the strength of the output signal. Historically, simple step activation functions were used when the summed input was above a threshold of 0.5, for example. Then the neuron would output a value of 1.0; otherwise, it would output a 0.0.
Traditionally, non-linear activation functions are used. This allows the network to combine the inputs in more complex ways and, in turn, provide a richer capability in the functions they can model.
Networks of Neurons
Neurons are arranged into networks of neurons. A row of neurons is called a layer, and one network can have multiple layers. The architecture of the neurons in the network is often called the network topology.
Input Layer
takes input from your data set and provides it to network. Often a neural network is drawn with a visible layer with one neuron per input value or column in your data set. These are not neurons as described above but simply pass the input value through to the next layer.
Hiiden Layers
Layers after the input layer are called hidden layers because they are not directly exposed to the input. The simplest network structure is to have a single neuron in the hidden layer that directly outputs the value.
Given increases in computing power and efficient libraries, very deep neural networks can be constructed. Deep learning can refer to having many hidden layers in your neural network
Output Layer
The final hidden layer is called the output layer, and it is responsible for outputting a value or vector of values that correspond to the format required for the problem.
The choice of activation function in the output layer is strongly constrained by the type of problem that you are modeling. For example:
A regression problem may have a single output neuron, and the neuron may have no activation function (linear activation function).
A binary classification problem may have a single output neuron and use a sigmoid activation function to output a value between 0 and 1 to represent the probability of predicting a value for the class 1.
A multi-class classification problem may have multiple neurons in the output layer, one for each class (e.g., three neurons for the three classes in the famous iris flowers classification problem). In this case, a softmax activation function may be used to output a probability of the network predicting each of the class values.
Training Networks
Once configured, the neural network needs to be trained on your data set.
Data Pre-proccessing
You must first prepare your data for training on a neural network.
Data must be numerical, for example, real values. If you have categorical data, such as a sex attribute with the values “male” and “female,” you can convert it to a real-valued representation called one-hot encoding. This same one-hot encoding can be used on the output variable in classification problems with more than one class.
Neural networks require the input to be scaled in a consistent way. You can rescale it to the range between 0 and 1, called normalization.
Optimisers
We need to decide on a training algorithm for the network. Common ones are RMSprop, adam and stochastic gradient descent.
When data is exposed to the network at a time as input. The network processes the input upward, activating neurons as it goes to finally produce an output value. This is called a forward pass on the network. It is the type of pass that is also used after the network is trained in order to make predictions on new data.
The output of the network is compared to the expected output, and an error is calculated. This error is then propagated back through the network, one layer at a time, and the weights are updated according to the amount they contributed to the error - using backpropagation.
The process is repeated for all of the examples in your training data. One round of updating the network for the entire training data set is called an epoch. A network may be trained for tens, hundreds, or many thousands of epochs.
Weight Updates
The weights in the network are updated from the errors calculated: \(w = w - \alpha \Delta E\)
The amount that weights are updated is controlled by a configuration parameter called the learning rate (\(\alpha\)). It is also called the step size and controls the step or change made to a network weight for a given error. Often small weight sizes are used, such as 0.1 or 0.01 or smaller.
Prediction
Once a neural network has been trained, it can be used to make predictions.
You make predictions on testing data in order to estimate the skill of the model on unseen data. Predictions are made by providing the input to the network and performing a forward-pass, allowing it to generate an output you can use as a prediction.
Setup
First load the required packages for this notebook. Namely, the Keras library - this is a high-level neural networks API. It was developed with a focus on enabling fast experimentation.
library(tidyverse)
library(keras)
library(tensorflow)
Example 1: Boston Housing Regression
Loading Data
Let’s begin by loading the data set for our problem.
boston_housing <- dataset_boston_housing()
Train and Test Sets
Next, we shall obtain our features and labels from the data.
c(train_data, train_labels) %<-% boston_housing$train
c(test_data, test_labels) %<-% boston_housing$test
Now we have our training and testing data. let’s see the dimensions of them.
paste("Shape of Training set:",dim(train_data))
## [1] "Shape of Training set: 404" "Shape of Training set: 13"
paste("Shape of Testing set:",dim(test_data))
## [1] "Shape of Testing set: 102" "Shape of Testing set: 13"
Next, it is always advised to explore your data. Let’s see a summary of our features. This helps us in deciding whether scaling is required.
summary(train_data)
## V1 V2 V3 V4
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08144 1st Qu.: 0.00 1st Qu.: 5.13 1st Qu.:0.00000
## Median : 0.26888 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.74511 Mean : 11.48 Mean :11.10 Mean :0.06188
## 3rd Qu.: 3.67481 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## V5 V6 V7 V8
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4530 1st Qu.:5.875 1st Qu.: 45.48 1st Qu.: 2.077
## Median :0.5380 Median :6.199 Median : 78.50 Median : 3.142
## Mean :0.5574 Mean :6.267 Mean : 69.01 Mean : 3.740
## 3rd Qu.:0.6310 3rd Qu.:6.609 3rd Qu.: 94.10 3rd Qu.: 5.118
## Max. :0.8710 Max. :8.725 Max. :100.00 Max. :10.710
## V9 V10 V11 V12
## Min. : 1.000 Min. :188.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.23 1st Qu.:374.67
## Median : 5.000 Median :330.0 Median :19.10 Median :391.25
## Mean : 9.441 Mean :405.9 Mean :18.48 Mean :354.78
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.16
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## V13
## Min. : 1.73
## 1st Qu.: 6.89
## Median :11.39
## Mean :12.74
## 3rd Qu.:17.09
## Max. :37.97
Scaling Data
Neural networks work a lot better when dealing with data between 0 and 1. As such we scale our data. It’s often a good idea to scale data so all variables lie on nearly the same scale. If variables have very different ranges, a one-unit change in one variable might represent a huge change (say on a probability scale), while the same one-unit change might represent a tiny change (say if units are metres and distances are very large). One way to scale variables is to divide each variable by its mean and divide by its standard deviation.
Note that means and standard deviation’s should always come from the training set, even if scaling the validation and test sets. Otherwise we are using information in the test set in our model building, which we shouldn’t do. Most of the time if observations have been randomly allocated to training and test sets it won’t make much difference (because the variable means and standard deviations will be similar in training and test sets), but we should do the right thing.
The scale function stores means and standard deviations
as attributes of the scaled object, so we can extract these and use them
to scale variables in the validation and test data sets. We do this
below and view an updated summary.
train_data <- scale(train_data)
apply(train_data, 2, mean) # mean should be 0
apply(train_data, 2, sd) # sd should be 1
attributes(train_data) # previous means and sds used to scale stored here
we note that we don’t have to scale the targets, only the input features.
train_data <- scale(train_data)
summary(train_data)
## V1 V2 V3 V4
## Min. :-0.404599 Min. :-0.48302 Min. :-1.5628 Min. :-0.2565
## 1st Qu.:-0.396470 1st Qu.:-0.48302 1st Qu.:-0.8771 1st Qu.:-0.2565
## Median :-0.376186 Median :-0.48302 Median :-0.2077 Median :-0.2565
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.007608 3rd Qu.: 0.04291 3rd Qu.: 1.0271 3rd Qu.:-0.2565
## Max. : 9.223411 Max. : 3.72437 Max. : 2.4423 Max. : 3.8888
## V5 V6 V7 V8
## Min. :-1.4694 Min. :-3.81252 Min. :-2.3661 Min. :-1.2859
## 1st Qu.:-0.8897 1st Qu.:-0.55275 1st Qu.:-0.8423 1st Qu.:-0.8192
## Median :-0.1650 Median :-0.09662 Median : 0.3396 Median :-0.2945
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6279 3rd Qu.: 0.48172 3rd Qu.: 0.8980 3rd Qu.: 0.6786
## Max. : 2.6740 Max. : 3.46289 Max. : 1.1091 Max. : 3.4331
## V9 V10 V11 V12
## Min. :-0.9704 Min. :-1.3097 Min. :-2.6704 Min. :-3.7664
## 1st Qu.:-0.6255 1st Qu.:-0.7627 1st Qu.:-0.5685 1st Qu.: 0.2113
## Median :-0.5105 Median :-0.4562 Median : 0.2836 Median : 0.3875
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6738 3rd Qu.: 1.5633 3rd Qu.: 0.7835 3rd Qu.: 0.4396
## Max. : 1.6738 Max. : 1.8338 Max. : 1.6015 Max. : 0.4475
## V13
## Min. :-1.5178
## 1st Qu.:-0.8065
## Median :-0.1855
## Mean : 0.0000
## 3rd Qu.: 0.5999
## Max. : 3.4777
Model Building and Training
Let’s define our model.The firs thing to do is call
keras_model_sequential. This allows us to build models that
enable layers to be stacked upon each other.
model <- keras_model_sequential() %>%
layer_dense(units = 64, activation = "relu",input_shape = dim(train_data)[2]) %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 1)
The first layer here, is a fully connected layer. It has 64 units, ReLu activation function, and the input shape is the dimensions of the training data (columns or features in training data). So since we have 13 features, we then have 13 inputs.For the subsequent layer (layer 2) we don’t need to define input shape. We specify ReLu activation and 64 units as well. In the final layer, we specify units of 1, and no specification for activation function is made. The default activation function is the linear function.
Note typically in the first layer it is not advised to sue less that 13 units. And also with respect to the number of units, people tend to sue the number of units to the power of 2.
Next we can compile it. here we need to tell the model, what loss function to us, what optimizer to use and what metric we want. We decided t use mean squared error: \(\frac{1}{n}\sum(\hat{y}-y)^2\). With respect tot optimiser’s, popular ones include stochastic gradient descent adam and rmsprop.What is nice about optimiser’s like RMSprop and Adam, is that we do not have to specify the learning rate - they will use a default value and then try to find an optimal value. Metric is just for us, the model doesn’t use it all in terms of training and fitting. Mean absolute error is given as \(\frac{1}{m}\sum(\hat{y}-y)\).
model %>% compile(
loss = "mse",
optimizer = optimizer_rmsprop(),
metrics = list("mean_absolute_error")
)
Now we are set for training. We call the model, model in
this case, and the fit function. We need to specify the
training data, the training labels, as well as the epochs (how many
times will the network see the data). We also specify the batch size,
which tells the network that its going to read 5 examples at a time, to
compute all those weight updates. The weight updates are specified as:
\(w = w-\alpha\Delta E\). That s, the
new wight is the old weight minus some learning rate times that change
in error (partial derivative computed through backpropagation). We also
set shuffle to TRUE, since sometimes there may be some
intrinsic order in the data that we don’t want.
history <- model %>% fit(
train_data, train_labels,
epochs = 50, batch_size = 5,
validation_split = 0.2, shuffle = TRUE
)
Training Performance
When calling the fit function, Keras provides feedback of what happens to the loss during training. This is useful in determining if the model was over-fitting for example. We plot the training performance below.
plot(history)
Initially, the training and validation are quite high, as the epochs increase these values are decreased. We see that training is below validation and seems to stay that way. This is an indication of overfitting.
Predictions
Having done this we can proceed with making predictions on the testing data - unseen data. Since we trained the model on scaled data we will have to scale our testing data as well.
test_predictions <- model %>% predict(scale(test_data))
Let’s take a peak at the predictions and correct values.
round(test_predictions[ , 1][0:15],2)
## [1] 6.46 19.99 22.43 26.51 25.41 22.69 28.37 21.96 20.02 19.14 19.89 17.39
## [13] 15.01 42.61 16.24
test_labels[0:15]
## [1] 7.2 18.8 19.0 27.0 22.2 24.5 31.2 22.9 20.5 23.2 18.6 14.5 17.8 50.0 20.8
Now we introduce something new, this allows us to know when an epoch has completed. In this case, the epoch number is printed on each even numbered epoch. This is essentially just to show you the progress in training.
print_dot_callback <- callback_lambda(
on_epoch_end = function(epoch, logs) {
if (epoch %% 2 == 0) cat(epoch, '\n')
}
)
Note that 100 epochs may not be performed since we have added early stopping.
history <- model %>% fit(
train_data,
train_labels,
epochs = 100,
validation_split = 0.2,
verbose = 0,
callbacks = list(print_dot_callback))
## 0
## 2
## 4
## 6
## 8
## 10
## 12
## 14
## 16
## 18
## 20
## 22
## 24
## 26
## 28
## 30
## 32
## 34
## 36
## 38
## 40
## 42
## 44
## 46
## 48
## 50
## 52
## 54
## 56
## 58
## 60
## 62
## 64
## 66
## 68
## 70
## 72
## 74
## 76
## 78
## 80
## 82
## 84
## 86
## 88
## 90
## 92
## 94
## 96
## 98
Let’s plot the training performance.
plot(history)
Evaluations
Finally we shall evaluate our model. This tells us the loss and mean absolute error - the metrics we specified we wanted. the evaluate function essentially does the predict for us.
model %>% evaluate(scale(test_data), test_labels, verbose = 0)
## loss mean_absolute_error
## 16.864580 2.803717
Example 2: Iris Classification
Load Data
iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE)
Let’s return the first part of iris.
head(iris)
## V1 V2 V3 V4 V5
## 1 5.1 3.5 1.4 0.2 Iris-setosa
## 2 4.9 3.0 1.4 0.2 Iris-setosa
## 3 4.7 3.2 1.3 0.2 Iris-setosa
## 4 4.6 3.1 1.5 0.2 Iris-setosa
## 5 5.0 3.6 1.4 0.2 Iris-setosa
## 6 5.4 3.9 1.7 0.4 Iris-setosa
We have four input features. And we have the class shown in the far right column. We have three types of classes and 150 examples. This is shown below in the structure.
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ V1: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ V2: num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ V3: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ V4: num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ V5: chr "Iris-setosa" "Iris-setosa" "Iris-setosa" "Iris-setosa" ...
Next up, dimensions. It’s key to look at the data before we do any prepossessing and modelling. We just assign variable names to features, to help make it make sense. We finally, just show the cleaned up data frame.
dim(iris)
## [1] 150 5
names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")
iris <- as.data.frame(iris)
iris %>% head(5)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 Iris-setosa
## 2 4.9 3.0 1.4 0.2 Iris-setosa
## 3 4.7 3.2 1.3 0.2 Iris-setosa
## 4 4.6 3.1 1.5 0.2 Iris-setosa
## 5 5.0 3.6 1.4 0.2 Iris-setosa
Visualise Data
Let’s view the data by plotting it out.
plot(iris$Petal.Length,
iris$Petal.Width,
pch=21, bg=c("red","green3","blue")[unclass(iris$Species)],
xlab="Petal Length",
ylab="Petal Width")
We see there is some overlap between the classes.
We need to convert the target values into something more suitable.
Currently they are strings. They needed to be numbers. We specify
-1 as we want the classes starting from 0.
numerical_target <- factor(iris[,5])
iris[,5] <- as.numeric(numerical_target) -1
We shall turn iris into a matrix. and check the
dimensions.
iris <- as.matrix(iris)
dim(iris)
## [1] 150 5
Scaling Data
Normalize the iris data using scale
function.
iris_features <- scale(iris[,1:4])
iris_target <- iris[,5]
Now let’s return the summary of iris features.
summary(iris_features)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :-1.86378 Min. :-2.4308 Min. :-1.5635 Min. :-1.4396
## 1st Qu.:-0.89767 1st Qu.:-0.5858 1st Qu.:-1.2234 1st Qu.:-1.1776
## Median :-0.05233 Median :-0.1245 Median : 0.3351 Median : 0.1328
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.67225 3rd Qu.: 0.5674 3rd Qu.: 0.7602 3rd Qu.: 0.7880
## Max. : 2.48370 Max. : 3.1043 Max. : 1.7804 Max. : 1.7052
We see the data ranges are a lot more consistent. And return the
summary of iris target. This just confirms that the values
are 0, 1 and 2.
summary(iris_target)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 1 1 2 2
Train and Test Data
We split our data into training and testing data. We used a 67/33 split. So our training data has 67% of the data observations and the testing set has the remaining 33%.
# Determine sample size
ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.67, 0.33))
# Split the `iris` data
x_train <- iris_features[ind==1, 1:4]
x_test <- iris_features[ind==2, 1:4]
# Split the class attribute
y_train <- iris_target[ind==1]
y_test <- iris_target[ind==2]
One-Hot Encoding
When e want our network output to give probabilities, we know that we need to one hot encode our original target. As such, here we shall convert targets/labels to their one-hot encoded equivalent
y_train <- to_categorical(y_train)
y_test_original = y_test
y_test <- to_categorical(y_test)
Model Building
We just check dimensions of targets. This helps infer building and specifying model.
dim(y_train)
## [1] 112 3
dim(y_test)
## [1] 38 3
Let’s build that model. We call use the sequential model. We essentially just stacking layers. We are adding fully connected layers, aka dense layers. Note that it’s always good to keep checking the API on Keras website, for more details about he default values and arguments in these layers. We note that the default activation function for the dense layer, it is linear activation function that is applied.
Again the first layer, we have to specify the input shape of the
data. The Dropout layer, layer_dropout, randomly sets input
units to 0 with a frequency of rate at each step during training time,
which helps prevent overfitting. So rate is afloat between
0 and 1 and represents the fraction of the input units to drop.
model <- keras_model_sequential()
model %>%
layer_dense(units = 8, activation = 'relu', input_shape = c(4)) %>%
layer_dropout(rate = 0.5) %>%
layer_dense(units = 3, activation = 'softmax')
Note a high dropout means we probably cutting out a a lot of weights. We need to specify 3 units in output layer, since we have 3 classes. Since its a classification problem we choose softmax - it makes sense to use this over sigmoid, since its not a binary case and is a multi-class classification problem.
We can print out a summary of the network architecture. This
essentially just gives us a high level overview and output of each
function. Params gives us the number of parameters that are
trained for each layer - \(weights +
biases\). The total number of parameters is shown at the end,
which is equal to the number of trainable and non-trainable
parameters.
summary(model)
## Model: "sequential_1"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## dense_4 (Dense) (None, 8) 40
## dropout (Dropout) (None, 8) 0
## dense_3 (Dense) (None, 3) 27
## ================================================================================
## Total params: 67
## Trainable params: 67
## Non-trainable params: 0
## ________________________________________________________________________________
We shall now compile our model. We need to provide extra information to train the model. We need to specify the loss function, the optimiser and what metric to display to the user. We in a classification problem, as such categorical cross entropy is appropriate. We choose adam optimiser here, and specify a learning rate of 0.1.
model %>% compile(
loss = 'categorical_crossentropy',
optimizer = optimizer_adam(lr = 0.01),
metrics = c('accuracy'),
)
Let’s get to training the neural network. We call the fit function. Again we specify, epochs, validation split and batch size. We also need to tell the network which is the training data and the labels. Remember batch size is the amount of data that gets loaded into memory for updates. So here we will load 5, then another 5, then another 5, until we seen all data - this will represent one epoch.
history <- model %>% fit(
x_train, y_train,
epochs = 300, batch_size = 5,
validation_split = 0.2, shuffle = TRUE, verbose = 0
)
Training Performance
When calling the fit function, Keras provides feedback of what happens to the loss during training. This is useful in determining if the model was over-fitting for example.
plot(history)
Evaluate the Performance
model %>% evaluate(x_test, y_test)
## loss accuracy
## 0.08418372 0.94736844
We evaluate We get a good, high accuracy of 94%.
We can also get the confusion matrix, since it is a classification problem.
# update to video: predict_classes deprecated from tensorflow >= v2.6
Y_test_hat <- model %>% predict(x_test) %>% k_argmax() %>% as.numeric()
table(y_test_original, Y_test_hat)
## Y_test_hat
## y_test_original 0 1 2
## 0 12 0 0
## 1 0 16 0
## 2 0 2 8
A confusion matrix is a table that is used to define the performance of a classification algorithm. A confusion matrix visualizes and summarizes the performance of a classification algorithm.
Example 3: Car Sales Regression
Load Data
We start by reading in our data as a csv file. We also quickly view the first 10 rows of the data.
sales <- read.csv(url("https://raw.githubusercontent.com/MGCodesandStats/datasets/master/cars.csv"), header = TRUE)
head(sales, 10)
## age gender miles debt income sales
## 1 28 0 23 0 4099 620
## 2 26 0 27 0 2677 1792
## 3 30 1 58 41576 6215 27754
## 4 26 1 25 43172 7626 28256
## 5 20 1 17 6979 8071 4438
## 6 58 1 18 0 1262 2102
## 7 44 1 17 418 7017 8520
## 8 39 1 28 0 3282 500
## 9 44 0 24 48724 9980 22997
## 10 46 1 46 57827 8163 26517
Note that Sales is the target variables. So in this problem we are trying to predict car sales using these 5 features.
str(sales)
## 'data.frame': 963 obs. of 6 variables:
## $ age : int 28 26 30 26 20 58 44 39 44 46 ...
## $ gender: int 0 0 1 1 1 1 1 1 0 1 ...
## $ miles : int 23 27 58 25 17 18 17 28 24 46 ...
## $ debt : int 0 0 41576 43172 6979 0 418 0 48724 57827 ...
## $ income: int 4099 2677 6215 7626 8071 1262 7017 3282 9980 8163 ...
## $ sales : int 620 1792 27754 28256 4438 2102 8520 500 22997 26517 ...
Given the small size of the data, we know we shouldn’t create a very complex network model. Let’s check a summary of the data.
summary(sales)
## age gender miles debt income
## Min. :19.00 Min. :0.000 Min. :10.0 Min. : 0 Min. : 0
## 1st Qu.:27.00 1st Qu.:0.000 1st Qu.:20.0 1st Qu.: 1475 1st Qu.: 3506
## Median :37.00 Median :1.000 Median :25.0 Median : 6236 Median : 6360
## Mean :37.97 Mean :0.513 Mean :27.7 Mean :14109 Mean : 6176
## 3rd Qu.:49.00 3rd Qu.:1.000 3rd Qu.:32.0 3rd Qu.:16686 3rd Qu.: 8650
## Max. :60.00 Max. :1.000 Max. :97.0 Max. :59770 Max. :11970
## sales
## Min. : 500
## 1st Qu.: 3554
## Median : 9130
## Mean :11690
## 3rd Qu.:19245
## Max. :29926
We have some very small and large numbers. So it is suggested to scale our features.
Scale Data
In feature based problems, we often need to do scaling.
# Max-Min Normalization
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
maxmindf <- as.data.frame(lapply(sales, normalize))
attach(maxmindf)
head(maxmindf, 10)
## age gender miles debt income sales
## 1 0.21951220 0 0.14942529 0.000000000 0.3424394 0.004078026
## 2 0.17073171 0 0.19540230 0.000000000 0.2236424 0.043906749
## 3 0.26829268 1 0.55172414 0.695599799 0.5192147 0.926187725
## 4 0.17073171 1 0.17241379 0.722302158 0.6370927 0.943247468
## 5 0.02439024 1 0.08045977 0.116764263 0.6742690 0.133827228
## 6 0.95121951 1 0.09195402 0.000000000 0.1054302 0.054441650
## 7 0.60975610 1 0.08045977 0.006993475 0.5862155 0.272548087
## 8 0.48780488 1 0.20689655 0.000000000 0.2741855 0.000000000
## 9 0.60975610 0 0.16091954 0.815191568 0.8337510 0.764527968
## 10 0.65853659 1 0.41379310 0.967492053 0.6819549 0.884150071
After scaling we see the values are in similar ranges. Neural network will probably do a better job using such data.
View and Cleaning
We shall remove the header from the data frame and convert into a matrix. This is general pre-processing.
names(maxmindf) <- NULL
sales = data.matrix(maxmindf)
Always great to get to know the shape of data.
dim(sales)
## [1] 963 6
Features, Labels, Train and Test Sets
Now we shall split the data into features and targets.
sales_features <- sales[,1:5]
sales_target <- sales[,6]
Now we can generate our training and testing data sets. We used a 70/30 split.
# Determine sample size
ind <- sample(2, nrow(sales), replace=TRUE, prob=c(0.70, 0.30))
# Split the data
x_train <- sales_features[ind==1, 1:4]
x_test <- sales_features[ind==2, 1:4]
# Split the class attribute
y_train <- sales_target[ind==1]
y_test <- sales_target[ind==2]
Note, its common practice to have four variables.
x_trainandy_trainx_testandy_test
Model Building
Let’s conduct some model building, where we can generate our layers and specify inputs, units etc.
dim(x_train)[2]
## [1] 4
Now using our knowledge about our input shape. We can specify this in the input layer. We should always start with a simple model, as shown below.
simple_model <- keras_model_sequential() %>%
layer_dense(units = 4, activation = "relu",input_shape = dim(x_train)[2]) %>%
layer_dense(units = 1)
Then we can make more complex models - shown below is an example of this.
model <- keras_model_sequential() %>%
layer_dense(units = 16, activation = "relu",input_shape = dim(x_train)[2]) %>%
layer_dense(units = 8, activation = "relu") %>%
layer_dense(units = 4,activation = "relu") %>%
layer_dense(units = 1)
For the first layer, we have 16 units. We have 4 inputs, so typically we shouldn’t have less units than number of inputs. We have two hidden layers, where we have 8 and 4 units respectively. Both these layers use a ReLu activation function. In output layer we using linear activation function (default).
Now we can compile our model.
simple_model %>% compile(
loss = "mse",
optimizer = 'adam',
metrics = list("mean_absolute_error")
)
We set the epochs to 220 initially. Our validation split is smaller than usual, only 10% of the training set - since our data is relatively small.
history <- simple_model %>% fit(
x_train, y_train,
epochs = 220, batch_size = 50,
validation_split = 0.1, shuffle = TRUE
)
Training Performance
plot(history)
At this stage we usually go back and tweak our model, compile it and train again. This process is repeated until satisfactory results are achieved - if possible.
Evaluations
simple_model %>% evaluate(x_test, y_test)
## loss mean_absolute_error
## 0.01918746 0.11168385
Example 4: MNIST Classification
MNIST is a data set of handwritten characters. For this example, instead of treating it as an image based classification problem ,we shall rather treat it as a feature based classification problem. The data set is found in the Keras library as well.
In this example follow the very similar steps we have been doing in the previous examples. As such we do not provide a lot of annotations or comments, unless deemed necessary.
Load dataset
Get data and features, as well as targets.
mnist <- dataset_mnist()
The data has the training and testing data
str(mnist)
## List of 2
## $ train:List of 2
## ..$ x: int [1:60000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ y: int [1:60000(1d)] 5 0 4 1 9 2 1 3 1 4 ...
## $ test :List of 2
## ..$ x: int [1:10000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ y: int [1:10000(1d)] 7 2 1 0 4 1 4 9 5 9 ...
We shall then set these corresponding data into our four different
variables. So then we end up with x_train,
x_test, y_train and y_test.
x_train <- mnist$train$x
x_test <- mnist$test$x
y_train <- mnist$train$y
y_test <- mnist$test$y
View Data
paste("Shape of Training Data:", dim(x_train))
## [1] "Shape of Training Data: 60000" "Shape of Training Data: 28"
## [3] "Shape of Training Data: 28"
paste("Shape of Testing Data:", dim(x_test))
## [1] "Shape of Testing Data: 10000" "Shape of Testing Data: 28"
## [3] "Shape of Testing Data: 28"
head(x_test)
head(x_train)
Will likely have to do one hot encoding.
Reshape the Data
We reshape the values to convert the image into a vector.
Each image is originally \(28\times28\times1\) (the last \(\times1\) is due to the fact that the image is grey scale). So each \(28\times28\) image can be converted into a vector of length 784 (\(28\times28 = 784\)).
So the original training data was (60 000, 28, 28). We reshape it to (60 000, 784).
c(nrow(x_test))
## [1] 10000
x_train <- array_reshape(x_train, c(nrow(x_train), 784))
x_test <- array_reshape(x_test, c(nrow(x_test), 784))
Just to confirm.
paste("Shape of Training Data:", dim(x_train))
## [1] "Shape of Training Data: 60000" "Shape of Training Data: 784"
paste("Shape of Testing Data:", dim(x_test))
## [1] "Shape of Testing Data: 10000" "Shape of Testing Data: 784"
Rescale Data
dim(x_train)
## [1] 60000 784
dim(x_test)
## [1] 10000 784
x_train <- x_train / 255
x_test <- x_test / 255
One-Hot Encoding
y_train <- to_categorical(y_train, 10)
y_test <- to_categorical(y_test, 10)
dim(y_train)
## [1] 60000 10
Create Model
model <- keras_model_sequential()
We can add a number of layers and activation functions to the model.
We start off by adding model %>% then append each new
layer (with it’s activation function) on a new line.
The first layer has one extra thing which the others do not have. For
the first layer we specify the input shape, which denotes the shape of
the data input. In this case the shape of the input is just a vector of
length 784, and thus we add input_shape = c(784).
First we start off with a simple model with two layers then we will add more complexity to the model.
In each case the line
model <- keras_model_sequential() is added otherwise we
will keep on adding layers to the first instance of the variable
model and create a massive model.
model <- keras_model_sequential()
model %>%
layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>%
layer_dense(units = 10, activation = 'softmax')
Next, we add an extra layer so the model can learn more complexities.
model <- keras_model_sequential()
model %>%
layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>%
layer_dense(units = 128, activation = 'relu') %>%
layer_dense(units = 10, activation = 'softmax')
Next, we add dropout to the first hidden layer.
model <- keras_model_sequential()
model %>%
layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>%
layer_dropout(rate = 0.4) %>%
layer_dense(units = 128, activation = 'relu') %>%
layer_dense(units = 10, activation = 'softmax')
Finally, we add dropout to the next hidden layer.
model <- keras_model_sequential()
model %>%
layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>%
layer_dropout(rate = 0.4) %>%
layer_dense(units = 128, activation = 'relu') %>%
layer_dropout(rate = 0.3) %>%
layer_dense(units = 10, activation = 'softmax')
Print out a summary of the network architecture.
summary(model)
## Model: "sequential_8"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## dense_21 (Dense) (None, 256) 200960
## dropout_3 (Dropout) (None, 256) 0
## dense_20 (Dense) (None, 128) 32896
## dropout_2 (Dropout) (None, 128) 0
## dense_19 (Dense) (None, 10) 1290
## ================================================================================
## Total params: 235,146
## Trainable params: 235,146
## Non-trainable params: 0
## ________________________________________________________________________________
On our training examples, we are roughly fitting over 200 000 weights.
Compile & Train Model
We need to provide extra information to train the model. We need to specify the loss function, the optimiser and what metric to display to the user.
We use categorical cross entropy: \(-\sum{y_i}\ln(\hat{y}_i)\)
model %>% compile(
loss = 'categorical_crossentropy',
optimizer = optimizer_rmsprop(),
metrics = c('accuracy')
)
history <- model %>% fit(
x_train, y_train,
epochs = 10, batch_size = 128,
validation_split = 0.2
)
Training Performance
When calling the fit function, Keras provides feedback of what happens to the loss during training. This is useful in determining if the model was over-fitting for example.
plot(history)
Evluations
Evaluate the performance on the test data.
model %>% evaluate(x_test, y_test)
## loss accuracy
## 0.07894263 0.97810000
Prediction
To predict we can either predict on an entire matrix or on a subset. In the case of a subset, you need to make sure that the correct dimensions are used as the network has certain input expectations. In this case, the model expects data in this format: [batches, 784]. So you can send any number of batches of data to the network.
dim(x_test)
## [1] 10000 784
Here we want to predict on the first 10 test examples. But just using
x_test[0:10] would result in the incorrect dimension. So we
need to reshape the data.
subset <- array_reshape(array(x_test[0:10,]), c(10, 784))
Here we check the dimensions.
dim(subset)
## [1] 10 784
Finally, we can predict on the 10 first examples.
# update to video: predict_classes deprecated from tensorflow >= v2.6
model %>% predict(subset) %>% k_argmax() %>% as.numeric()
## [1] 8 7 2 6 5 5 7 2 6 6
Here we predict on all of the x_testdata.
model %>% predict(x_test) %>% k_argmax() %>% as.numeric()%>%head(10)
## [1] 7 2 1 0 4 1 4 9 6 9
Summary
This was a short introduction to deep learning in R with Keras. We looked at four different examples, two regression problems and two classification problems. We covered the basis in exploring and preprocessing of the data. After this, we looked at constructing a deep learning model; in these cases, we built a Multi-Layer Perceptron (MLP) for multi-class classification and for regression problems - predicting a numerical output. We then proceeded, with looking at how we can compile and fit the model to your data, how we can visualize the training history. Finally, we looked at evaluating the model by predicting target values based on test data.
Data Loading
Data Pre-proccesing
Before you can build your model, you also need to make sure that your
data is cleaned, normalized (if applicable) and divided into training
and test sets. At first sight, when you inspected the data with
head(), you didn’t really see anything out of the ordinary,
right? Let’s make use of summary() and str().
We want to work with features which have values in small ranges,
preferably between 0 and 1.
Data Splits
After we have checked the quality of your data and we know that it is or is not necessary to scale your data, we can continue to work with the original data and split it into training and test sets so that we’re finally ready to start building your model. By doing this, you ensure that we can make honest assessments of the performance of our predicted model afterwards.
One-Hot Encoding (Classification Problems)
When you want to model multi-class classification problems with
neural networks, it is generally a good practice to make sure that you
transform your target attribute from a vector that contains values for
each class value to a matrix with a Boolean for each class value and
whether or not a given instance has that class value or not. Using the
to_categorical() function in Keras will perform one-hot
encoding.
Construct, Compile and Train Model
To start constructing a model, you should first initialize a
sequential model with the help of the
keras_model_sequential() function. Then, you’re ready to
start modeling. We can then add layers, specify units in each layer, and
activation functions.
Once we set up the architecture of your model, it’s time to compile
and fit the model to the data. To compile your model, you configure the
model with a certain optimser. We looked at adam and RMSprop, there are
of course others. Some of the most popular optimization algorithms used
are the Stochastic Gradient Descent (SGD), ADAM and RMSprop. Depending
on whichever algorithm you choose, you’ll need to tune certain
parameters, such as learning rate or momentum. The choice for a loss
function depends on the task that you have at hand. Having said that, we
must also specify the loss function - for regression we use
mse and for classification we use
categorical_crossentropy. Additionally, we must also
provide a metric to assess model whilst training. The accuracy during
the training by passing accuracy is appropriate for
classification problems and the mean_absolute_error is good
for regression problems.
Assessing Training Perforamnce
Using the plot function, we can visualise the performance of the training process for the specified network. If your training data accuracy keeps improving while your validation data accuracy gets worse, you are probably overfitting: your model starts to just memorize the data instead of learning from it. If the trend for accuracy on both data sets is still rising for the last few epochs, you can clearly see that the model has not yet over-learned the training data set.
Fine Tuning Model
Fine-tuning your model is probably something that you’ll be doing a lot, especially in the beginning. There are already two key decisions that you’ll probably want to adjust: how many layers you’re going to use and how many “hidden units” you will choose for each layer. Besides playing around with the number of epochs or the batch size, there are other ways in which you can tweak your model in the hopes that it will perform better: by adding layers, by increasing the number of hidden units and by passing your own optimization parameters to the compile() function
Predictions
Now that your model is created, compiled and has been fitted to the
data, it’s time to actually use your model to predict the labels or
target values for your test set. As you might have expected, you can use
the predict() function to do this.
For classification problems, you can print out the confusion matrix to check out the predictions and the real labels.