Pix2Pix GAN Architecture

Saif Gazali
4 min readAug 7, 2021
Photo by CGAN Paper

The conversion of source image to target image is known as Image-to-Image translation. It requires specialized models and custom loss functions for a given task or dataset. Generative Adversarial Networks includes a generator model which is capable of generating new plausible fake samples that can be considered to be coming from an existing distribution of samples and a discriminator model that would classify the given sample as real or fake. The model weights of discriminator and the generator are updated according to the performance of the discriminator model.

Conditional GAN involves conditional generation of images by a generator model. It has the ability to generate targeted images of a given type unlike other GAN model which generates a random image from the domain. Pix2Pix GAN is based on conditional GAN where a target image is generated based on a given input image. The generator model is given an input image and generates a translated version of the image. The discriminator model is given an input image and a translated image and it has to predict whether the image is real or fake.

U-NET Generator Model

The generator and discriminator model use a standard Convolution-BatchNormalization-ReLU blocks of layers to create deep convolutional neural networks. The configuration of a specific layer can be seen in the appendix of the paper. The U-Net model architecture is used for the generator model rather than the traditional encoder-decoder model which involves taking image as input and down-sampling it for a few layers until a layer where in the image is up-sampled for a few layers and a final image is outputted. The UNet architecture also down-samples the image and up-samples it again but would have skip-connections between layers of same size in encoder and decoder which would allow the information to be shared between input and output.

UNET Architecture

Discriminator Model

The discriminator model takes an input image and a translated image and predicts the likelihood of the translated image as real or a generated image of a source input image. Pix2Pix GAN uses a PatchGAN rather than a DCNN which helps it classify patches of image as real or fake instead of an entire image. The discriminator is convolutionally run across the image where in we average all the responses to give the final output. The network outputs a single feature map of real and fake predictions that is averaged to give a single score. 70x70 patch size is considered to be effective across different image-to-image translation task.

PatchGAN Discriminator Model

Composite Adversarial and L1 loss

The discriminator model is trained in a traditional standalone manner as in a standard GAN model minimizing the likelihood of identifying real and fake images. Both the adversarial loss and the L1 loss between the generated translated image and the expected image are used to train the generator model. A composite loss function is used to update the generator model which is a combination of adversarial loss and L1 loss. In the paper they evaluated L2 loss as well but found that the images generated were blurry. Both the loss have their own function, adversarial loss makes sure that generator model outputs image that are plausible in the target domain and L1 loss makes sure that generator model output are a plausible translation of the source image.

Loss Function Analysis

Experiments are performed using only the L1 loss, only the adversarial loss and combination of L1 and adversarial loss. The images generated using L1 loss alone were blurry and the images generated using adversarial loss introduced some new artifacts in the image. The images generated using combination of both were more clearer.

Generator model analysis

The U-NET generator model was compared to a traditional encoder-decoder generator model using just L1 loss as well as L1+ adversarial loss. The encoder-decoder model was not able to generate realistic image. The U-NET generator model generated more sharper image. The results were not specific to loss functions as when both the models were trained only on L1 loss, the U-Net performed better again.

Discriminator model analysis

PatchGAN discriminator model with different sized receptive field were tested. The sizes experimented were 1 x 1, 16 x 16, 70 x 70 and 256 x 256 or ImageGAN. The larger the receptive field the deeper the network. Although the full sized ImageGAN produces crisper images but it is harder to train. 70 x 70 receptive field provided a good balance between the depth of the model and the image quality.

Resources

Image-to-Image Translation with Conditional Adversarial Networks, 2016

Machine Learning Mastery

--

--