Pix2pixHD: High-Resolution Image Translation

Understand pix2pixHD architecture and train the model for cityscapes dataset.

We'll cover the following

Pix2pixHDWang, Ting-Chun, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. "High-resolution image synthesis and semantic manipulation with conditional gans." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798-8807. 2018. is an upgraded version of the pix2pix model. The biggest improvement of pix2pixHD over pix2pix is that it supports image-to-image translation at 2048×10242048\times1024 resolution and with high quality.

Model architecture

To make this happen, they designed a two-stage approach to gradually train and refine the networks, as shown in the following diagram. First, a lower-resolution image of 1024×5121024 \times512 is generated by a generator network G1G_1, called the global generator (the red box). Second, the image is enlarged by a generator network G2G_2, called the local enhancer network so that it’s around 2048×10242048\times1024 in size (the opaque box). It is also viable to put another local enhancer network at the end to generate 4096×20484096\times2048 images. Note that the last feature map is also inserted into G2G_2 (before the residual blocks) via an element-wise sum to introduce more global information into higher-resolution images:

Get hands-on with 1200+ tech skills courses.