9 Advanced Deep-Learning Best Practices
9.1 Going beyond the sequential model: the Keras functino API
So far we have only applied the keras_model_sequential
to build the network.
In this section we cover how we can handle different inputs at the same time. They argue that just making several models and then just balancing out the results is too naive and it handles redundant information. Hence we see how different types of information can be injected into the same network.
We are for instance going to use the functional API.
9.1.1 Introduction to the functional API
Basically we can set our own input and output layers and then assemble them in our own model.
Here is a small example.
library(keras)
#The traditional approach
<- keras_model_sequential() %>% #Sequential model that we have used before
seq_model layer_dense(units = 32, activation = "relu", input_shape = c(64)) %>%
layer_dense(units = 32, activation = "relu") %>%
layer_dense(units = 10, activation = "softmax")
#The manual API approach
<- layer_input(shape = c(64))
input_tensor
<- input_tensor %>%
output_tensor layer_dense(units = 32, activation = "relu") %>%
layer_dense(units = 32, activation = "relu") %>%
layer_dense(units = 10, activation = "softmax")
<- keras_model(input_tensor, output_tensor) #This turns an input and output tensor into a model
model summary(model)
This is basically another approach to construct the model. Hence a manual way to construct what we did with keras_model_sequantial
.
But why do it?
- Here we are able to specify several inputs and outputs, hence also cmobining different types, e.g., images and texts.
An example:
- We see that there are key words, e.g., carpenter offer = a cheap house that should be renovated.
- If there are pictures of a view, e.g., a like etc.
- Hence one may not only use facts and metadata, but use the descriptions and images for the houses to capture that imformation in the model.
Then how to go about it?
- e.g., if we have boston housing data, we see that the metadata, sales description and images may be explaining some of the same things, hence if you build three different models you will include a lot of redundancy in the model. Also interactions between the data types will be ignored if you have three different models. This can be overcome by having all data types included in the same area. Such a model could look like this:
Hence we see that the data types and different approaches are put into the same model and then we use back propagonation to tune the parameters.
It works. But there is an example in the book where they just write
<- layer_input(shape = c(64))
unrelated_input <- keras_model(unrelated_input, output_tensor) bad_model
I don’t see what is different.
Now we can compile it
%>% compile(
model optimizer = "rmsprop",
loss = "categorical_crossentropy"
)<- array(runif(1000 * 64), dim = c(1000, 64)) #Just random data
x_train <- array(runif(1000 * 10), dim = c(1000, 10))
y_train
%>% fit(x_train, y_train, epochs = 10, batch_size = 128) #Train tha model
model %>% evaluate(x_train, y_train) #Evaluate the model model
9.1.2 Multi-input models
Now imagine that we want to run several inputs, for instance wit text sequences, then we can actually run the text sequences and then on a later point concatenate them before our densely connected layers. Such a network would look like this:

Figure 9.1: A model with several inputs
Hence we see that we have two separate branches that are being joined before the densely connected layers. Let us now look at how such a model is built and run.
#Listing 7.1. Functional API implementation of a two-input question-answering model
library(keras)
<- 10000 #Amount of words
text_vocabulary_size <- 10000 #Amount of words
ques_vocabulary_size <- 500 #The amount of words in our answer
answer_vocabulary_size
#The first input layer
<- layer_input(shape = list(NULL), #Text input is a variable-length
text_input dtype = "int32",
name = "text")
<- text_input %>%
encoded_text layer_embedding(input_dim = text_vocabulary_size + 1,
output_dim = 32) %>%
layer_lstm(units = 32)
#The second input layer
<- layer_input(shape = list(NULL),
question_input dtype = "int32", name = "question")
<- question_input %>%
encoded_question layer_embedding(input_dim = 32, output_dim = ques_vocabulary_size) %>%
layer_lstm(units = 16)
#Now we want to concat the layers
<- layer_concatenate(list(encoded_text, encoded_question))
concatenated
#We add a densely connected layer ontop with softmax activation
<- concatenated %>%
answer layer_dense(units = answer_vocabulary_size,
activation = "softmax")
#Assembling the model
<- keras_model(list(text_input, question_input), answer)
model
#Compiling
%>% compile(
model optimizer = "rmsprop",
loss = "categorical_crossentropy",
metrics = c("acc")
)
Notice that our answer vocabulary does only contain 500 words. They do not comment on it
<- 1000
num_samples <- 100
max_length
#Create dummy data
<- function(range, nrow, ncol) {
random_matrix matrix(sample(range, size = nrow * ncol, replace = TRUE),
nrow = nrow, ncol = ncol)
}<- random_matrix(1:text_vocabulary_size, num_samples, max_length)
text <- random_matrix(1:ques_vocabulary_size, num_samples, max_length)
question
#One hot encoding answers
<- random_matrix(0:1, num_samples, answer_vocabulary_size)
answers
#Fitting using a list of inputs
%>% fit(
model list(text, question), answers,
epochs = 10, batch_size = 128
)
#Fitting using a named list, this is basically the same
%>% fit(
model list(text = text, question = question),
answers,epochs = 10, batch_size = 128
)
For some reason this is not working
9.1.3 Multi-output
Instead of having multiple inputs we can also have multiple outputs. That would look like this:

Figure 9.2: A model with several outputs
Often this is applied for companies to predict unknown variables, e.g., age, gender and so on, thus the information that you dont give, so they can make the recommendations better.
So we aim for a model that is able to predict several outputs, for instance we have some SoMe data and we want to predict the age, income and gender of the person that we are getting information on.
#Listing 7.3. Functional API implementation of a three-output model
library(keras)
<- 50000
vocabulary_size <- 10
num_income_groups
<- layer_input(shape = list(NULL),
posts_input dtype = "int32", name = "posts")
<- posts_input %>%
embedded_posts layer_embedding(input_dim = 256, output_dim = vocabulary_size)
<- embedded_posts %>%
base_model layer_conv_1d(filters = 128, kernel_size = 5, activation = "relu") %>%
layer_max_pooling_1d(pool_size = 5) %>%
layer_conv_1d(filters = 256, kernel_size = 5, activation = "relu") %>%
layer_conv_1d(filters = 256, kernel_size = 5, activation = "relu") %>%
layer_max_pooling_1d(pool_size = 5) %>%
layer_conv_1d(filters = 256, kernel_size = 5, activation = "relu") %>%
layer_conv_1d(filters = 256, kernel_size = 5, activation = "relu") %>%
layer_global_max_pooling_1d() %>%
layer_dense(units = 128, activation = "relu")
#Creating and naming the output layers
<- base_model %>%
age_prediction layer_dense(units = 1
name = "age") #Notice that w ename it appropriately
,#Notice that activation is just be default linear, continous output
<- base_model %>%
income_prediction layer_dense(units = num_income_groups #One for each income group
activation = "softmax" #We want distributed probabilities
, name = "income")
,
<- base_model %>%
gender_prediction layer_dense(units = 1 #Binary output
activation = "sigmoid" #We want probabilities for male vs. female
, name = "gender")
,
#Creating the model
<- keras_model(
model #The input
posts_input, list(age_prediction, income_prediction, gender_prediction) #The output layers
)summary(model)
Now we see that we have built a model with almost 46.000.000 parameters and have several outputs.
Notice that we have three different output types, namely continous, categorical and binary. Hence we also want to apply three different loss functions and also three different accuracy measurements.
#Listing 7.4. Compilation options of a multi-output model: multiple losses
#Notice that we named the output layes, hence we can call these
%>% compile(
model optimizer = "rmsprop",
loss = list(
age = "mse",
income = "categorical_crossentropy",
gender = "binary_crossentropy"
),
#Balancing out the loss functions
loss_weights = list(
age = 0.25,
income = 1,
gender = 10
) )
An alternative could be to specify loss = c("mse", "categorical_crossentropy", "binary_crossentropy")
although that is not as easy to read.
We see that the different loss functions penalize differently, and in addition, we are also working on different scales. For instance we see that gender can only be two outcomes, hence you can only that far off, where the age for instance, you can be 75 but e.g., predicted to be 10 years old, and working with squared errors one will be able to be very far off, as it penalize such.
We are going to solve this with adding wheigts to the loss functions, see the code above.
Notice that we also call the outputs different heads of the network.
Now we could train the model.
Notice that we cannot train the model, as we dont have the input and the target values, but if we had, we could assign them in its respective list element.
%>% fit(
model #The input
posts, list( #The output
age = age_targets, #Notice that we use a named list
income = income_targets,
gender = gender_targets
),epochs = 10, batch_size = 64
)
Notice that we could also have encoded list(age_targets, income_targets, gender_targets)
, hence without using the naming.
9.1.4 Directed acyclic graphs of layers (DAG)
Now we have looked at multiple inputs and multiple outputs. We see that the hidden layers can also have a more complicated constellation. We are going to look at:
- Inception modules
- Residual connections
9.1.4.1 Inception modules
Here they apply different convolutions and then concatenates it before pushing the information into the densely connected layers. It could look like the following:

Figure 9.3: A model with several inputs
Their premise is that instead of going deep they go wide. As traditional CNN can be very deep, very cumbersome and also prone to overfitting.
In R it could look like this:
Things to notice:
- All output sizes must be the same
library(keras)
<- input %>%
branch_a layer_conv_2d(filters = 128, kernel_size = 1,
activation = "relu", strides = 2)
<- input %>%
branch_b layer_conv_2d(filters = 128, kernel_size = 1,
activation = "relu") %>%
layer_conv_2d(filters = 128, kernel_size = 3,
activation = "relu", strides = 2)
<- input %>%
branch_c layer_average_pooling_2d(pool_size = 3, strides = 2) %>%
layer_conv_2d(filters = 128, kernel_size = 3,
activation = "relu")
<- input %>%
branch_d layer_conv_2d(filters = 128, kernel_size = 1,
activation = "relu") %>%
layer_conv_2d(filters = 128, kernel_size = 3,
activation = "relu") %>%
layer_conv_2d(filters = 128, kernel_size = 3,
activation = "relu", strides = 2)
#Conatenate the layres
<- layer_concatenate(list(
output
branch_a, branch_b, branch_c, branch_d ))
Notice that we can load a fully trained model somewaht similar to the model above. Just as we loaded VGG16, we can load application_inception_v3
into the environment. Notice that this is trained on the ImageNet. Another pretrablined model is Xception
(Extreme Inception).
9.1.4.2 Residual Connection
Basically this is a shortcut function between the layers. Meaning that output for one layer can not only be input for the next layer, but it can also be input for deeper layers.
In general, this deals with two problems:
- Vanishing Gradients: where previous seen information is ruled out, and
- Representational Bottlenecks: Recall that in a traditional network each layer is only processing what it is given from the previous layer. Meaning that if previous neurons do not activate, then they will not push on that piece of information to the next layer. hence we can potentially loose key information in an early layer, for instance if we do not have enough units to handle the information. With residual connections
Since a given layer can have more than one input, and thus also activations, then we generally assume that the input sizes are the same. Although if they are not one must reshape the input.
Notice that this approach can be found in the pretrained network Xception
There is an example of how the residual connection could be built. It does not appear to be able to run, hence it will not be included in this notebook.
It can look like the following:
One see three different models, where the one a top is the ResNet, where we see that there are a lot of layers and to avoid the vanishing gradient problem is solved with having the residuals connections where previous information is injected into the model.
We see that the layers are more of bricks now where they can be modeled with, although one must have in mind that some bricks fit better together than other.
9.1.5 Layer weight sharing
Imagine a scenario where you want to assess the similarity of two different sentences, then it is crucial that both sentences are going through the same processing. This is what layer weight sharing is dealing with.
As the example above states, is that we are able to make so-called siamese twins. The following is merely an example:
library(keras)
#Creating a single layer, we use LSTM in this example
<- layer_lstm(units = 32)
lstm
#Building the left branch
<- layer_input(shape = list(NULL, 128))
left_input <- left_input %>% lstm()
left_output
#Building the right branch
<- layer_input(shape = list(NULL, 128))
right_input <- right_input %>% lstm()
right_output
#Merge the left and the right branch
<- layer_concatenate(list(left_output, right_output))
merged
#Building a classifier ontop
<- merged %>%
predictions layer_dense(units = 1, activation = "sigmoid")
#Assembling and training the model
<- keras_model(list(left_input, right_input), predictions)
model %>% fit(
model list(left_input = left_data #We assume that we have input
right_input = right_data) #We assume that we have input
, ,targets)
9.1.6 Models as layers
Basically one is able to load models into the network. I think this is best shown with the following chunk:
Imagine the scenario, where you have two cameras with two parallel cameras, this model can perceive depth, which can be useful in some situations. This cameras will be observing the same objects and hence it is fair to run the input the same way. Hence we construct siamese vision model
library(keras)
#The base image processing
<- application_xception(weights = NULL,
xception_base include_top = FALSE)
#The left and right input
<- layer_input(shape = c(250, 250, 3))
left_input <- layer_input(shape = c(250, 250, 3))
right_input
#Calling the same vision model
= left_input %>% xception_base()
left_features <- right_input %>% xception_base()
right_features
#Merging the features
<- layer_concatenate(
merged_features list(left_features, right_features)
)
Notice that the input of the cameras is 250 by 250 and with a RGB color code.
9.2 Inspecting and monitoring deep-learning models using Keras callba- acks and TensorBoard
This section covers some features that help train the models more efficiently and also presents an interactive board that can be used to assess performance of the models.
9.2.1 Using callbacks to act on a model during training
Until now, we have just trained the model with the given epochs. There is in fact a more clever way to train the model, where we for instance is able to stop training when the validation performance is not improved.
The chapter looks into some (not all callback functions). These are examples:
- Model checkpointing - enables one to store the wheights of the model at given checkpoints, hence we can run with a given amount of epochs, and then retrieve the wheights that appear to be best.
- Early stopping - This enables to interrupt the model
- Dynamically adjusting the value of certain parameters during training - This could for instance be having a dynamic learning rate, so for instance in the beginning it is big, while it decreases when you get closer to a minima
- Logging training and validation metrics during training, or visualizing the representations learned by the model as they are updated - The is basically what we already see, like how far are we in an epoch, what is the estimated finish time
These are examples:
callback_model_checkpoint()
callback_early_stopping()
callback_learning_rate_scheduler()
callback_reduce_lr_on_plateau()
callback_csv_logger()
9.2.1.1 The model-checkpoint and early-stopping callbacks
This is an example with callback_early_stopping()
and callback_model_checkpoint()
library(keras)
#Create a list for the callback
<- list(
callbacks_list
#Interrupt when there is no more improvement
callback_early_stopping(
monitor = "acc", #We monitor accuracy
patience = 1 #Stops when acc does not improve over more than 1 epoch
),
#Save the current weights after every epoch
callback_model_checkpoint(
filepath = "my_model.h5", #File name and destination
monitor = "val_loss",
save_best_only = TRUE
)
)%>% compile(
model optimizer = "rmsprop",
loss = "binary_crossentropy",
metrics = c("acc") #Monitor accuracy
)
#Train the model
%>% fit(
model #Input
x, #Output
y, epochs = 10, #Epochs
batch_size = 32, #Batches
callbacks = callbacks_list, #Callback lists
validation_data = list(x_val, y_val) #We monitor accuracy so we need validation data
)
9.2.1.2 The reduce-learning-rate-on-plateau callback
Here we want to adjust the learning rate if it appears that we are on a plateau. We use callback_reduce_lr_on_plateau()
<- list(
callbacks_list #Adjust the learning rate
callback_reduce_lr_on_plateau(
monitor = "val_loss", #Monitors the validation loss
factor = 0.1, #Devides by 10 (0.1 is relative to 1) when activated
patience = 10 #
)
)
#Building the model
%>% fit(
model
x, y,epochs = 10,
batch_size = 32,
callbacks = callbacks_list, #Notice we insert the callback list
validation_data = list(x_val, y_val)
)
9.2.1.3 Writing your own callback functions
If you are to write your own callback function I refer to the section in the book.
9.2.2 Introduction to tensorBoard: the TensorFlow visualization framework
Here they are representing a dashboard that can be used to assess how the model is doing.
The example is shown with the following code where we use the IMDB case again, it contains the following:
- Building the model
- Creating a directory
- Initiating the board
Creating the model
#Listing 7.7. Text-classification model to use with TensorBoard
library(keras)
<- 2000 #We only want to look at 2.000 words
max_features <- 500 #We only want the first 500 words
max_len
#Download and unpack the data
<- dataset_imdb(num_words = max_features)
imdb c(c(x_train, y_train), c(x_test, y_test)) %<-% imdb
#Create the variables
<- pad_sequences(x_train, maxlen = max_len)
x_train = pad_sequences(x_test, maxlen = max_len)
x_test
#Building the model
<- keras_model_sequential() %>%
model layer_embedding(input_dim = max_features, output_dim = 128,
input_length = max_len, name = "embed") %>%
layer_conv_1d(filters = 32, kernel_size = 7, activation = "relu") %>%
layer_max_pooling_1d(pool_size = 5) %>%
layer_conv_1d(filters = 32, kernel_size = 7, activation = "relu") %>%
layer_global_max_pooling_1d() %>%
layer_dense(units = 1)
summary(model)
%>% compile(
model optimizer = "rmsprop",
loss = "binary_crossentropy",
metrics = c("acc")
)
Initiating the directory
#Listing 7.8. Creating a directory for TensorBoard log files
<- "Data/3. Deep Learning/my_log_dir"
directory dir.create(directory)
#Listing 7.9. Training the model with a TensorBoard callback
#Launch the TensorBoard
tensorboard(directory)
= list(
callbacks callback_tensorboard(
log_dir = directory,
histogram_freq = 1, #We want a historygram of each epoch
embeddings_freq = 1, #We want embedding data for each epoch
)
)
<- model %>% fit(
history
x_train, y_train,epochs = 20,
batch_size = 128,
validation_split = 0.2,
callbacks = callbacks
)
Now we see that we are able to follow the progression on the board. We we can explore the model and its performance.
9.3 Getting the most of your models
So far, we have primarily interpreted how the models can build and what to be aware of. This section explores how to make models that are more than just fine.
9.3.1 Advanced architecture patterns
We have the following important architectures:
- Residuals connections which is already covered
- Batch Normalization
- Depthwise separable convolution
We are going to look at no. 2 and 3.
9.3.1.1 Batch normalization
In its essence, it helps the model to learn new data. So far we have done the following:
- Normalizing by demeaning and dividing by the standard deviation, although this assumes Gaussian distribution in the data (normal distribution).
Although we see that even though we have normalized data that is coming in, it does not mean that the data that is coming out of the layer is normalized, hence we can apply a normalization in between the steps.
One often see that these normalizations are put after convolutions and densely connected layers. Hence it could look like this:
layer_conv_2d(filters = 32, kernel_size = 3, activation = "relu") %>%
layer_batch_normalization()
layer_dense(units = 32, activation = "relu") %>%
layer_batch_normalization()
Besides just having normal data, this also makes it easier for the back propagonation to update the weights and stay on track, hence recall the landscape where we want to find the valley. When this is easier to navigate when batch normalizing, then we can increase the learning rate.
9.3.1.2 Depthwise separable convolution
We should always use this, instead of normal convolutions
Here there is an alternative to the layer_conv_2d
it is less cumbersome and appear to be working better with smaller datasets, in general the claim that it can outperform layer_conv_2d
. The layer is called layer_separable_conv_2d
and this layer is in fact able to be found in the Xception
pretrained model.
library(keras)
<- 64
height <- 64
width <- 3
channels <- 10
num_classes <- keras_model_sequential() %>%
model layer_separable_conv_2d(filters = 32, kernel_size = 3,
activation = "relu",
input_shape = c(height, width, channels)) %>%
layer_separable_conv_2d(filters = 64, kernel_size = 3,
activation = "relu") %>%
layer_max_pooling_2d(pool_size = 2) %>%
layer_separable_conv_2d(filters = 64, kernel_size = 3,
activation = "relu") %>%
layer_separable_conv_2d(filters = 128, kernel_size = 3,
activation = "relu") %>%
layer_max_pooling_2d(pool_size = 2) %>%
layer_separable_conv_2d(filters = 64, kernel_size = 3,
activation = "relu") %>%
layer_separable_conv_2d(filters = 128, kernel_size = 3,
activation = "relu") %>%
layer_global_average_pooling_2d() %>%
layer_dense(units = 32, activation = "relu") %>%
layer_dense(units = num_classes, activation = "softmax")
%>% compile(
model optimizer = "rmsprop",
loss = "categorical_crossentropy"
)
summary(model)
9.3.2 Hyperparameter optimization
Summary:
- Very much trial and error
- Often random choices can compete with intuition
- They expect an increase in automated hyperparameter optimization
9.3.3 Model ensembling
In general one see that competition winners make different models and thus compute predictions on different models and then aggregates upon these. Where we see that even super good single models are not able to outperform ensemble models.
Notice that this only works if all of the models are good and there is not e.g., one model which is significantly worse than the other models. Although one can correct for this with assigning uneven wheights for the models.
Recall the model bias explored in the first semester. When we are having an ensemble of other models and they do not have the same model bias, then we see that this will rule out model bias.