CS180-Proj5: Fun with Diffusion Models

David Wei
david_wei@berkeley.edu

1. Introduction

Project 5 consists two parts: The power of Diffusion Models (Part A) and Diffusion Models from Scratch (Part B).

In Part A, I mainly played around with a pretrained diffusion model called DeepFloyd IF. First, I used the model to conduct denoising. I blurred a sample image using random noise, and used the model to predict that noise. I also denoised the image using Gaussian Blur and compared the denoised results. Then, I denoised a random noise image to obtain a computer-generated image. I adapted the Classifier-Free Guidance technic. Later, I conducted image-to-image translation, where images are translated into similar images either based on masks or text prompts. At last, I produced Visual Anagrams, Hybrid Images, and a course logo.

In Part B, I built and trained a diffusion model from scratch. First, I trained a UNet to denoise half-noisy MNIST images (original image + 50% pure noise). Then, to denoise images with different amount of noise, I added Time Conditioning to the UNet, where the UNet is told how noisy each images are. The trained UNet can accurately predict the noise that had been added to the images. Using the trained UNet denoiser, I generated MNIST-like images by denoising pure noise in 300 steps, only to find the computer generated images looks little like human-written numbers. To boost the result, I added Class Conditioning to the UNet, where the UNet is not only told how noisy the images are, but also the labels (0 to 9) of the images. 10% of the images are not provided with a label. I adapted Classifier-Free Guidance to generate MNIST-like images. The results are much better compared the previous attempt.

A1. Denoising using Diffusion Model

In part A, I played around DeepFloyd IF diffusion model. DeepFloyd is a text-to-image model trained by Stability AI. It takes text prompts as input and outputs images that are aligned with the text. Figure 1 shows three images I generated with DeepFloyd IF.

Figure 1a
Figure 1a: An oil painting of a snowy mountain.
Figure 1b
Figure 1b: A man wearing a hat.
 
Figure 1c
Figure 1c: A rocket ship.
 

Diffusion models generate images by predicting the noise in the images, and "denoise" it by subtracting the noise. The resulting "clean" image is often a different image, or a computer-generated image. To test this, I added different amount of random noise to the Sather Tower sample image, and tried to restore the image using the noisy image and the model's noise prediction. Figure 2 shows the result.

Add 25% noise:
Figure 2a
Original image.
Figure 2b
25% noisy image.
Figure 2c
Restored image.
Add 50% noise:
Figure 2d
Original image.
Figure 2e
50% noisy image.
Figure 2f
Restored image.
Add 75% noise:
Figure 2g
Original image.
Figure 2h
75% noisy image.
Figure 2i
Restored image.

Denoising becomes harder when more noise are added. I used Iterative Denoising to produce better result. Figure 3 shows the restored image using Iterative Denoising, One-Step Denoising, and Gaussian Blur, respectively.

Figure 3a
Original image.
Figure 3b
Noisy image.
Figure 3c
Iterative Denoising.
Figure 3d
One-Step Denoising.
Figure 3e
Gaussian Blur.

A2. Generate Random Image using Diffusion Model

When denoising a 100%-noisy image (pure noise) using diffusion model, we get random "denoised" images. These images are unique and does not come from the natural world, so we can also call them "computer-generated" images. Figure 4 shows some of these randomly generated images.

Figure 4a
Figure 4b
Figure 4c
Figure 4d
Figure 4e
Figure 4f

Most of these images are not good. To improve the quality of the images, I used a technic called Classifier-Free Guidance. Figure 5 shows the images generated using CFG.

Figure 5a
Figure 5b
Figure 5c
Figure 5d
Figure 5e
Figure 5f

A3. Image to Image Translation

By adding a certain amount of noise to an image, and then denoise it using diffusion model, we can produce similar images with different details. These details are generated by the computer. The more noise added to the image, the more "computer-creativity" there are in the restored image. Here are some examples.

Figure 6a
Add 96% noise then denoise
Figure 6b
Add 90% noise then denoise
Figure 6c
Add 84% noise then denoise
Figure 6d
Add 78% noise then denoise
Figure 6e
Add 69% noise then denoise
Figure 6f
Add 39% noise then denoise
Figure 6g
Original image
Figure 6h
Add 96% noise then denoise
Figure 6i
Add 90% noise then denoise
Figure 6j
Add 84% noise then denoise
Figure 6k
Add 78% noise then denoise
Figure 6l
Add 69% noise then denoise
Figure 6m
Add 39% noise then denoise
Figure 6n
Original image
Figure 6o
Add 96% noise then denoise
Figure 6p
Add 90% noise then denoise
Figure 6q
Add 84% noise then denoise
Figure 6r
Add 78% noise then denoise
Figure 6s
Add 69% noise then denoise
Figure 6t
Add 39% noise then denoise
Figure 6u
Original image

The same can be conducted on web image and two of my hand-drawn images (which are suppose to be a portrait of myself and the Apple logo. But to be fair, I'm not a good drawer, and sometimes I don't recognize my own drawing).

Figure 7a
Add 96% noise then denoise
Figure 7b
Add 90% noise then denoise
Figure 7c
Add 84% noise then denoise
Figure 7d
Add 78% noise then denoise
Figure 7e
Add 69% noise then denoise
Figure 7f
Add 39% noise then denoise
Figure 7g
Original image
Figure 7h
Add 96% noise then denoise
Figure 7i
Add 90% noise then denoise
Figure 7j
Add 84% noise then denoise
Figure 7k
Add 78% noise then denoise
Figure 7l
Add 69% noise then denoise
Figure 7m
Add 39% noise then denoise
Figure 7n
Original image
Figure 7o
Add 96% noise then denoise
Figure 7p
Add 90% noise then denoise
Figure 7q
Add 84% noise then denoise
Figure 7r
Add 78% noise then denoise
Figure 7s
Add 69% noise then denoise
Figure 7t
Add 39% noise then denoise
Figure 7u
Original image

I also produced inpaintings, where a certain part of the image is preserved, but the rest is generated by the diffusion model. They are shown in the follwing.

Figure 8a
Original image
Figure 8b
Mask
Figure 8c
To replace
Figure 8d
Inpainting
Figure 8e
Original image
Figure 8f
Mask
Figure 8g
To replace
Figure 8h
Inpainting
Figure 8i
Original image
Figure 8j
Mask
Figure 8k
To replace
Figure 8l
Inpainting

The following shows my attempts of conducting Text-Conditioned Image-to-Image Translation. The images are generated by trying to denoise a noisy image (e.g. a tower) in the direction of the text prompt (e.g. a rocket). Here are the results.

text: a rocket ship
Figure 9a
Add 96% noise then denoise
Figure 9b
Add 90% noise then denoise
Figure 9c
Add 84% noise then denoise
Figure 9d
Add 78% noise then denoise
Figure 9e
Add 69% noise then denoise
Figure 9f
Add 39% noise then denoise
Figure 9g
Original image
text: a photo of a man
Figure 9h
Add 96% noise then denoise
Figure 9i
Add 90% noise then denoise
Figure 9j
Add 84% noise then denoise
Figure 9k
Add 78% noise then denoise
Figure 9l
Add 69% noise then denoise
Figure 9m
Add 39% noise then denoise
Figure 9n
Original image
text: a lithograph of waterfalls
Figure 9o
Add 96% noise then denoise
Figure 9p
Add 90% noise then denoise
Figure 9q
Add 84% noise then denoise
Figure 9r
Add 78% noise then denoise
Figure 9s
Add 69% noise then denoise
Figure 9t
Add 39% noise then denoise
Figure 9u
Original image

A4. Visual Anagrams and Hybrid Images

I generated Visual Anagrams, where the object in the image is a different object when flipped upside-down. Here are the images.

Unflipped:
Figure 10a
A photo of a dog.
Figure 10b
An oil painting of people around a campfire.
Figure 10c
A lithograph of waterfalls.
Flipped:
Figure 10d
A photo of a man.
Figure 10e
An oil painting of old man under a light bulb.
Figure 10f
A lithograph of a skull.

I also generated Hybrid images, where a image may appear as different objects when looked closely and afar. Here are the images.

Figure 11a
Waterfall + Skull
Figure 11b
Snow mountain villege + Campfire
Figure 11c
Man + Dog

Besides the successful Waterfull-Skull Hybrid, the other Hybrid images just seems to be simply stacking the elements of the two objects. The Snow-mountain-villege-Camfire Hybrid just protraits people having campfire on a snowy mountain.

A5. Bells and Whistles: Class Logo

Along all the class logos I generated using diffusion model, the following two logos are my favourates, which showcased both 'computer' (robot) and 'vision' (camera).

Figure 12a
Figure 12b

B1. UNet implementation

I implemented a UNet denoiser to denoise MNIST images at different noise level. Noisy images are generated by adding pure noise to the original image. The pure noise used in this project has a standard normal distribution. Different levels of noisy MNIST images are shown here. The images are noised at 0%, 20%, 40%, 50%, 60%, 80%, 100%, respectively.

Figure 13

The Unet denoiser is implemented with the following structure.

Figure 14

I trained the UNet on 50%-noisy MNIST images (original image + 50% pure noise), and used the Mean Square Error (MSE) between the UNet's predicted image and the real clean image as loss function. The UNet can denoise the noisy image effectively. Here are the training loss curve and the denoised images.

Figure 15
Figure 16

Even though the UNet is trained to denoise 50%-noisy images, It can denoise images with other noise levels well. Here are the noised images (first row, noised at 0%, 20%, 40%, 50%, 60%, 80%, 100%, respectively) and the images denoised by the UNet (second row).

Figure 17

B2. Time-Conditioned UNet

To improve the UNet's ability to denoise images with different noise levels, I used t to tell the UNet how noisy the image is. The UNet would predict the noise that had been added to the image. The following shows the training algorithm.

Figure 18

The UNet was trained for 20 epochs. Adam optimizer with an initial learning rate of 1e-3 was used for gradient descent. Each hidden layers in the UNet had 128 neurons. For each training step, the UNet learnt from a batch size of 128 noisy MNIST images. Here is the training loss curve.

Figure 19

Ideally, when feeding the UNet a pure noise image to denoise, the UNet would generate a "human-written" number image with its own understanding. This process is called Sampling. The sampling process is defined as follows.

Figure 20

I preprocessed the sampled images by setting all pixels under a certain threshold as black. In this way minor noises are further eliminated. The computer-generated MNIST images after 5 and 20 training epochs are shown here.

Figure 21 Computer-Generated MNIST images using Time-Conditioned UNet after 5 training epochs.
Figure 22 Computer-Generated MNIST images using Time-Conditioned UNet after 20 training epochs.

It's clear that the generated images are not satisfying, as many of them are hardly readable.

B3. Class-Conditioned UNet

The unsatisfying results generated by the Time-Conditioned UNet was probably due to the lack of image labels. The computer didn't know which number it should generate, so it generated a combination of all numbers. To fix this, I added Class-Conditioning to the UNet. During training, the UNet would be told which number each noisy image represents, and in sampling, we would tell the UNet which number it should generate. Classifier-Free Guidance is applied in the sampling process. Algorithm 3 and 4 shows the training and sampling process for Class-Conditioned UNet, respectively.

Figure 23
Figure 23

The hyperparams I used for Class-Conditioned training are identical to the previous training of the Time-Conditioned UNet. Here is the training loss curve.

Figure 24

Here are the results after 5 and 20 training epochs, respectively.

Figure 21 Computer-Generated MNIST images using Class-Conditioned UNet after 5 training epochs.
Figure 22 Computer-Generated MNIST images using Class-Conditioned UNet after 20 training epochs.

Given enough training, the results improved dramatically compared to the Time-Conditioning attempt.

* Part A finished on Nov 6, 2024; Part B finished on Nov 16, 2024.