Project 5 consists two parts: The power of Diffusion Models (Part A) and Diffusion Models from Scratch (Part B).
In Part A, I mainly played around with a pretrained diffusion model called DeepFloyd IF. First, I used the model to conduct denoising. I blurred a sample image using random noise, and used the model to predict that noise. I also denoised the image using Gaussian Blur and compared the denoised results. Then, I denoised a random noise image to obtain a computer-generated image. I adapted the Classifier-Free Guidance technic. Later, I conducted image-to-image translation, where images are translated into similar images either based on masks or text prompts. At last, I produced Visual Anagrams, Hybrid Images, and a course logo.
In Part B, I built and trained a diffusion model from scratch. First, I trained a UNet to denoise half-noisy MNIST images (original image + 50% pure noise). Then, to denoise images with different amount of noise, I added Time Conditioning to the UNet, where the UNet is told how noisy each images are. The trained UNet can accurately predict the noise that had been added to the images. Using the trained UNet denoiser, I generated MNIST-like images by denoising pure noise in 300 steps, only to find the computer generated images looks little like human-written numbers. To boost the result, I added Class Conditioning to the UNet, where the UNet is not only told how noisy the images are, but also the labels (0 to 9) of the images. 10% of the images are not provided with a label. I adapted Classifier-Free Guidance to generate MNIST-like images. The results are much better compared the previous attempt.
In part A, I played around DeepFloyd IF diffusion model. DeepFloyd is a text-to-image model trained by Stability AI. It takes text prompts as input and outputs images that are aligned with the text. Figure 1 shows three images I generated with DeepFloyd IF.
Diffusion models generate images by predicting the noise in the images, and "denoise" it by subtracting the noise. The resulting "clean" image is often a different image, or a computer-generated image. To test this, I added different amount of random noise to the Sather Tower sample image, and tried to restore the image using the noisy image and the model's noise prediction. Figure 2 shows the result.
Add 25% noise: |
Add 50% noise: |
Add 75% noise: |
Denoising becomes harder when more noise are added. I used Iterative Denoising to produce better result. Figure 3 shows the restored image using Iterative Denoising, One-Step Denoising, and Gaussian Blur, respectively.
When denoising a 100%-noisy image (pure noise) using diffusion model, we get random "denoised" images. These images are unique and does not come from the natural world, so we can also call them "computer-generated" images. Figure 4 shows some of these randomly generated images.
Most of these images are not good. To improve the quality of the images, I used a technic called Classifier-Free Guidance. Figure 5 shows the images generated using CFG.
By adding a certain amount of noise to an image, and then denoise it using diffusion model, we can produce similar images with different details. These details are generated by the computer. The more noise added to the image, the more "computer-creativity" there are in the restored image. Here are some examples.
The same can be conducted on web image and two of my hand-drawn images (which are suppose to be a portrait of myself and the Apple logo. But to be fair, I'm not a good drawer, and sometimes I don't recognize my own drawing).
I also produced inpaintings, where a certain part of the image is preserved, but the rest is generated by the diffusion model. They are shown in the follwing.
The following shows my attempts of conducting Text-Conditioned Image-to-Image Translation. The images are generated by trying to denoise a noisy image (e.g. a tower) in the direction of the text prompt (e.g. a rocket). Here are the results.
text: a rocket ship |
text: a photo of a man |
text: a lithograph of waterfalls |
I generated Visual Anagrams, where the object in the image is a different object when flipped upside-down. Here are the images.
Unflipped: |
Flipped: |
I also generated Hybrid images, where a image may appear as different objects when looked closely and afar. Here are the images.
Besides the successful Waterfull-Skull Hybrid, the other Hybrid images just seems to be simply stacking the elements of the two objects. The Snow-mountain-villege-Camfire Hybrid just protraits people having campfire on a snowy mountain.
Along all the class logos I generated using diffusion model, the following two logos are my favourates, which showcased both 'computer' (robot) and 'vision' (camera).
I implemented a UNet denoiser to denoise MNIST images at different noise level. Noisy images are generated by adding pure noise to the original image. The pure noise used in this project has a standard normal distribution. Different levels of noisy MNIST images are shown here. The images are noised at 0%, 20%, 40%, 50%, 60%, 80%, 100%, respectively.
The Unet denoiser is implemented with the following structure.
I trained the UNet on 50%-noisy MNIST images (original image + 50% pure noise), and used the Mean Square Error (MSE) between the UNet's predicted image and the real clean image as loss function. The UNet can denoise the noisy image effectively. Here are the training loss curve and the denoised images.
Even though the UNet is trained to denoise 50%-noisy images, It can denoise images with other noise levels well. Here are the noised images (first row, noised at 0%, 20%, 40%, 50%, 60%, 80%, 100%, respectively) and the images denoised by the UNet (second row).
To improve the UNet's ability to denoise images with different noise levels, I used t to tell the UNet how noisy the image is. The UNet would predict the noise that had been added to the image. The following shows the training algorithm.
The UNet was trained for 20 epochs. Adam optimizer with an initial learning rate of 1e-3 was used for gradient descent. Each hidden layers in the UNet had 128 neurons. For each training step, the UNet learnt from a batch size of 128 noisy MNIST images. Here is the training loss curve.
Ideally, when feeding the UNet a pure noise image to denoise, the UNet would generate a "human-written" number image with its own understanding. This process is called Sampling. The sampling process is defined as follows.
I preprocessed the sampled images by setting all pixels under a certain threshold as black. In this way minor noises are further eliminated. The computer-generated MNIST images after 5 and 20 training epochs are shown here.
It's clear that the generated images are not satisfying, as many of them are hardly readable.
The unsatisfying results generated by the Time-Conditioned UNet was probably due to the lack of image labels. The computer didn't know which number it should generate, so it generated a combination of all numbers. To fix this, I added Class-Conditioning to the UNet. During training, the UNet would be told which number each noisy image represents, and in sampling, we would tell the UNet which number it should generate. Classifier-Free Guidance is applied in the sampling process. Algorithm 3 and 4 shows the training and sampling process for Class-Conditioned UNet, respectively.
The hyperparams I used for Class-Conditioned training are identical to the previous training of the Time-Conditioned UNet. Here is the training loss curve.
Here are the results after 5 and 20 training epochs, respectively.
Given enough training, the results improved dramatically compared to the Time-Conditioning attempt.
* Part A finished on Nov 6, 2024; Part B finished on Nov 16, 2024.