Project Background

Summary

The goal of our project is to implement the algorithm which takes the style of an artwork and applies it to a photo. To do this, we have to extract the content and style from the image. The advance of Convolutional Neural Network (CNN) can help us to extract information from images. After we’ve written our own code and analyzed others’ result, we have fully understood the concepts and the mathematical algorithms. We also got the image representation derived from Convolutional Neural Network.

Introduction to Convolutional Neural Networks

CNN are made up of neurons, each of them is a computational block and has learnable basis and parameters. Every block receives some inputs and performs a dot product. Each layer extracts a certain feature from the input image. So the output of a layer consists of differemtly filtered version of the input image.

Structure of CNN

There are many types of layers for neural networks. An example CNN architecture includes Convolutional Layer, Pooling Layer and Fully-Connected Layer. The Convolutional Layer computes the output of the neurons, each computing a dot product between the weights and a small region they are connected to (Gatys). The convolutional layers perform transformations that are based on not only the input value but also the weights and biases of the neurons. The Convolutional layer’s parameters include a set of learnable filters. Each filter is along the width and height and extends through the depth of input image. Then the filter slides over the height and width of the input. Dot product is calculated between the filter and the input at any position (Gatys). For each filter, we will get a separate response.

Pooling layer performs downsampling operation along the width and height dimension (Gatys). In our project, we used average-pooling. The function of it is to progressively reduce the spatial size of the presentation to reduce the computation in network (Gatys). While the numbers of different filters increase, the size of the filtered image is reduced, leading to a decrease in the total number units of each layer.

Fully-connected layer computes the class score. Neurons in this layer have fully connections to all layers. The only difference between the Fully-connected layer and the Convolutional layer is the way of connection (Gatys).

In this way, CNNs transform the original image layer by layer. An input image is encoded in each layer of the CNN by the filter response to that image. The trained CNN develop a representation of the image that makes object’s information more explicit during the process.

Gradient Descent

The parameters in the convolutional layers are trained with gradient descent. In practice with the backward passing of convolution operation, every neuron will compute the gradient for its weights. Then these weights will be added together for a slice in one depth (Gatys). A single set of weights for a slice will be updated.

Advantages

CNN architecture is based on the assumption that the inputs are images. Compared to the regular Neural Nets which consists of fully-connected neurons, CNNs local connected neurons is more efficient. Fully-connected means any neuron is connected to every neuron in the previous layer. But in CNNs, the neuron is connected to only a local region of the input. Local connectivity makes the forward functions more efficient and reduces the amount of parameters in the network (Gatys).

Introduction to the Algorithm

Summary

This algorithm uses a CNN (described above) to represent and combine the style of an artwork and the content of a photo. The type of neural network the algorithm uses is a modified version of a VGG-19 network. The details of this algorithm are covered in the sections below.

Content Representation

The information of the object in the image will become more explicit. Higher layers in the CNN captures more content compared to actual pixel values. The input image is transformed into representations that increasingly related to the actual content of the image. The responses in higher layers of the CNN is the content representation. When we are trying to achieve the content reconstruction, we can visualize the information by performing gradient descent on a white noise image. Let p and x be the original image and the image generated. P and F are their representation in a layer l. Fij is the activation of filter i at position j. The squared-error loss is defined as below (Gatys).

Then the derivative of the loss with respect to activations is calculated.

From the derivative, the gradient with respect to the image x can be computed using standard error back-propagation. So the initially random image x can be changed in this process until it generates the same response in a certain layer of CNN as the original image p (Gatys).

Style Representation

To obtain a representation of the style, we use feature space to capture texture information (Gatys). It consists of correlations between the different filter responses, which is called Gram matrix. The Gram matrix in layer l is defined below.

When we are trying to achieve the style reconstruction, we can visualize the information by performing gradient descent from a white noise image to minimise the mean squared distance between the original image and the generated image (Gatys). For layer l with N distinct filters each of size M, let a and x be the original image and the image generated. A and G will be the respective style representation in layer l (Gatys). The contribution of layer l to the total loss is defined as below.

The total loss is defined below. Where w is the weighting factor of contribution of each layer to the total loss.

The derivative of E can be computed

The visual result of styles’ and contents’ representations and reconstructions are shown below.

Combining Style and Content

After we have the representations of the style and the content, we have to transfer the style of artwork onto a photograph. The task is to minimise the distance of the representation of the feature representation of the photograph and the style representation of a painting (Gatys). The loss function we minimise is

The procedure of combining the style and content is shown below. First content and style features are extracted by passing the style image and content image through the network. Then a random white noise image is passed through the network. The total loss and the gradient of loss is computed. The gradient of the loss is used to update the image x until it matched the style feature and the content feature simultaneously.

Image Source: http://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Gatys_Image_Style_Transfer_CVPR_2016_paper.pdf