CS180 Final Project: Neural Radiance Field (NeRF)

David Wei
david_wei@berkeley.edu

1. Introduction

My final project consists two parts: Fit a Neural Field to a 2D Image (Part A), and Fit a Neural Radiance Field from Multi-view Images (Part B).

In Part A, I built a Multiplayer Perceptron (MLP) network to fit a single 2D image so that given any pixel's coordinate, the network can predict the pixel's RGB color. When the image's shape is provided, the network can reconstruct the whole image.

In Part B, I trained a larger MLP network to serve as a Neural Radiance Field (NeRF) and used it to fit a 3D Lego object through inverse rendering from multi-view calibrated images. The pixels on the images were bounded with rays represented in 3D world coordinate system. Sample locations were gathered along the rays, and their volume rendering results were used to fit the RGB colors on the images' pixels. In this way the Lego object was modeled into the NeRF. Using the trained NeRF, I'm able to predict the images of the Lego taken from any given perspectives. I rendered these images into a video to create a rotating effect of the Lego.

Part A: Fit a Neural Field to a 2D Image

In Part A, I trained a Neural Field to fit a single 2D image. This means that given any pixel's coordinate, the network can predict the pixels' RGB color. When the image's shape is provided, the network can reconstruct the whole image.

In order to map 2D pixel coordinates to RGB colors, I built a Multilayer Perceptron (MLP) network defined as follows.

Figure 1
Figure 1: Architecture of MLP network to fit a 2D image.

The Sinusoidal Positional Encoding (PE) is conducted to expand the dimensionality of the input coordinates. Here the PE operation is defined as \[ PE(x) = \{x, sin(2^0\pi x), cos(2^0\pi x), sin(2^1\pi x), cos(2^1\pi x), ..., sin(2^{L-1}\pi x), cos(2^{L-1}\pi x)\}, \] where L is the highest frequency level.

It is not feasible to train the network with all the pixels in each iteration due to the GPU memory limit, so I implemented a dataloader that randomly samples a batch size of N pixels at every iteration of training. I normalized the pixel coordinates into [0, 1] before feeding it to PE encoding.

I trained my model on an animal image and the famous Lenna image shown in Figure 2.

Figure 2a
Figure 2a: The animal image.
Figure 2b
Figure 2b: The Lenna image.

The model is trained for 2,000 steps on both images, with each step learning from a batch size of N = 10,000 pixels. I used the Adam optimizer with an initial learning rate of 1e-2 for gradient descent. I set the hidden layer size in the MLP net as 256 neurons, and the highest frequency level in PE encoding as L = 10, which would map a 2 dimension coordinate into a 42 dimension vector. I chose the Mean Squared Error (MSE) between the predicted image and the real image as the loss function, and used Peak Signal-to-Noise Ratio (PSNR) to measure the reconstruction of the image. PSNR is defined as \[ PSNR = 10 \cdot log_{10}\left(\frac{1}{MSE}\right). \] The MSE/PSNR graphs for both trainings are shown in Figure 3.

Figure 3a
Figure 3a: The MSE/PSNR graph for the training on the animal image.
Figure 3b
Figure 3b: The MSE/PSNR graph for the training on the Lenna image.

The predicted images after different number of training steps are shown in Figure 4.

Figure 4a
Figure 4a: Predicted animal image after 0 training steps.
Figure 4b
Figure 4b: Predicted animal image after 100 training steps.
Figure 4c
Figure 4c: Predicted animal image after 250 training steps.
Figure 4d
Figure 4d: Predicted animal image after 500 training steps.
Figure 4e
Figure 4e: Predicted animal image after 1000 training steps.
Figure 4f
Figure 4f: Predicted animal image after 1950 training steps.
Figure 4g
Figure 4g: Predicted Lenna image after 0 training steps.
Figure 4h
Figure 4h: Predicted Lenna image after 100 training steps.
Figure 4i
Figure 4i: Predicted Lenna image after 250 training steps.
Figure 4j
Figure 4j: Predicted Lenna image after 500 training steps.
Figure 4k
Figure 4k: Predicted Lenna image after 1000 training steps.
Figure 4l
Figure 4l: Predicted Lenna image after 1950 training steps.

I reduced the highest frequency level in PE encoding to L = 5 and the hidden layer size to 128 neurons, and train the MLP network on the animal images again. Figure 5 shows the result comparison between the reduced network and the original network.

Figure 5a
Figure 5a: Predicted animal image on reduced MLP net.
Figure 5b
Figure 5b: Predicted animal image on original MLP net.

The predicted image generated by the reduced MLP net is hardly accurate. This indicates that an abundant network size is necessary to yield good result.

Part B: Fit a Neural Radiance Field from Multi-view Images

In Part B, I was given 100 images of the same Lego object taken at different viewpoint, along with the camera's coordinates and angles when the images were taken. I modeled the 3D Lego object using a Neural Radiance Field, and use it to predict images of the Lego object taken at new perspectives.

B1. Transform 2D Pixel Coordinates to Rays in 3D World Coordinates

To generate a model that represents a 3D object via multi-view calibrated images, I need to transform a 2D pixel coordinate in the camera space to a ray in 3D space represented by world space coordinate. The real world correspondences that caused in the RGB color of that pixel should be located along that ray.

First, I used the intrinsic matrix \(\mathbf{K}\) to transform 2D pixel coordinate to 3D coordinate in camera space. The intrinsic matrix in defined as \[ \begin{align} \mathbf{K} = \begin{bmatrix} f_x & 0 & o_x \\ 0 & f_y & o_y \\ 0 & 0 & 1 \end{bmatrix} \end{align}, \] where \((f_x, f_y)\) is the camera's focal length, and \((o_x, o_y)\) is the camera's principal point, defined as \[ o_x = \text{image_width} / 2; \] \[ o_y = \text{image_height} / 2. \] The projection from a 2D location \((u, v)\) in pixel coordinate to a 3D point \((x_c, y_c, z_c)\) in the camera coordinate system is denoted as \[ \begin{align} s \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = \mathbf{K} \begin{bmatrix} x_c \\ y_c \\ z_c \end{bmatrix} \end{align}, \] in which \(s=z_c\) is the depth of this point along the optical axis.

Then, I used the rotation matrix \(\mathbf{R}_{3\times3}\) and a translation vector \(\mathbf{t}\) to transform 3D camera space coordinate \(\mathbf{X_c} = (x_c, y_c, z_c)\) to 3D world space coordinate \(\mathbf{X_w} = (x_w, y_w, z_w)\), denoted as \[ \begin{align} \begin{bmatrix} x_c \\ y_c \\ z_c \\ 1 \end{bmatrix} = \begin{bmatrix} \mathbf{R}_{3\times3} & \mathbf{t} \\ \mathbf{0}_{1\times3} & 1 \end{bmatrix} \begin{bmatrix} x_w \\ y_w \\ z_w \\ 1 \end{bmatrix} \end{align}. \]

A ray can be defined by an origin \(\mathbf{r}_o \in R^3\) vector and a direction \(\mathbf{r}_d \in R^3\) vector. I want to know the \(\{\mathbf{r}_o, \mathbf{r}_d\}\) for every pixel \((u, v)\). The origin \(\mathbf{r}_o\), or the location of the camera, is \[ \begin{align} \mathbf{r}_o = -\mathbf{R}_{3\times3}^{-1}\mathbf{t} \end{align}. \] To calculate the ray direction for pixel \((u, v)\), I chose a point along this ray with depth equals 1 \((s=1)\) and find its coordinate in the world space \(\mathbf{X_w} = (x_w, y_w, z_w)\). Then the normalized ray direction can be computed by \[ \begin{align} \mathbf{r}_d = \frac{\mathbf{X_w} - \mathbf{r}_o}{||\mathbf{X_w} - \mathbf{r}_o||_2} \end{align}. \] In this way, I got to compute the corresponding ray \(\{\mathbf{r}_o, \mathbf{r}_d\}\) for every pixel \((u, v)\) in each camera.

B2. Sampling along the Ray

Similar to part A, I randomly sampled a batch size of N rays from different cameras. I discretized each ray into a number of 3D points. The points are sampled uniformly along each ray between a "near point" and a "far point". During training, I would add a minor perturbation to the points' coordinates. Figure 6 shows the sampled rays (black line) and points (red dot) during each step of training. Their coordinates will be fed to the MLP net as inputs. I reduced the batch size (number of rays sampled) to N = 10 to produce a less crowded figure.

Figure 6
Figure 6: Sampled rays and points during each step of training.

B3. Volume Rendering

The pixel's color captured by the camera is the combined effort of all objects along the pixel's corresponding ray. The color can be represented by the volume rendering equation \[ \begin{align} C(\mathbf{r})=\int_{t_n}^{t_f} T(t) \sigma(\mathbf{r}(t))\mathbf{c}(\mathbf{r}(t), \mathbf{d}) d t, \text { where } T(t)=\exp \left(-\int_{t_n}^t \sigma(\mathbf{r}(s)) d s\right)\end{align}. \] The discrete approximation of this equation can be stated as \[ \begin{align} \hat{C}(\mathbf{r})=\sum_{i=1}^N T_i\left(1-\exp \left(-\sigma_i \delta_i\right)\right) \mathbf{c}_i, \text { where } T_i=\exp\left(-\sum_{j=1}^{i-1} \sigma_j \delta_j\right) \end{align}, \] where \(\textbf{c}_i\) is the RGB color obtained at location \(i\), \(T_i\) is the probability of a ray not terminating before sample location \(i\), and \(1 - e^{-\sigma_i \delta_i}\) is the probability of a ray terminating at sample location \(i\), in which \(\sigma_i\) is the color density at location \(i\), and \(\delta_i\) is the step size, or the distance between each sample location.

I wish to train a Neural Radiance Field where the rendered color along a ray would match the color the camera captured. The smaller the difference between the rendered color and its corresponding true pixel color, the better the model fits the 3D object.

B4. Training a NeRF Net.

I built a larger MLP network to fit a Neural Radiance Field to a 3D object. It takes the 3D world coordinates and the normalized ray direction as inputs, and predicts the RGB color of that sample location and its density. Figure 7 shows the architecture of the MLP net.

Figure 7
Figure 7: Architecture of MLP network to fit a 3D object.

I trained two MLP models. Each hidden layers in the two MLP net contains 256 and 1,024 neurons, respectively. The smaller MLP net was trained for 2,000 steps while the larger one was trained for 4,000 steps. For each training step, the smaller model randomly samples 10,000 rays with 32 sample locations on each ray, while the larger net randomly samples 5,000 rays with 64 sample locations on each ray. The highest frequency level for PE encoding was set as L = 10 in the smaller net, and L = 20 in the larger net. For both models, I used the Adam optimizer with an initial learning rate of 5e-4 for gradient descent. Similar to part A, I chose MSE between the rendered color and the true pixel color as loss function, and used PSNR to measure the reconstruction of the 3D object. The MSE/PSNR graphs for the two approaches are as follows.

Figure 8a
Figure 8a: The MSE/PSNR graph for the MLP net with 256 neurons in each hidden layer.
Figure 8b
Figure 8b: The MSE/PSNR graph for the MLP net with 1,024 neurons in each hidden layer.

The PSNR of the first approach reached 23, while the second reached 27.

Figure 9 shows the net's predicted image of the Lego object when looked from behind, after different numbers of training steps.

Figure 9a
Figure 9a: Predicted Lego image after 100 training steps using the smaller MLP net.
Figure 9b
Figure 9b: Predicted Lego image after 250 training steps using the smaller MLP net.
Figure 9c
Figure 9c: Predicted Lego image after 500 training steps using the smaller MLP net.
Figure 9d
Figure 9d: Predicted Lego image after 1000 training steps using the smaller MLP net.
Figure 9e
Figure 9e: Predicted Lego image after 1500 training steps using the smaller MLP net.
Figure 9f
Figure 9f: Predicted Lego image after 1950 training steps using the smaller MLP net.
Figure 9g
Figure 9g: Predicted Lego image after 100 training steps using the larger MLP net.
Figure 9h
Figure 9h: Predicted Lego image after 250 training steps using the larger MLP net.
Figure 9i
Figure 9i: Predicted Lego image after 500 training steps using the larger MLP net.
Figure 9j
Figure 9j: Predicted Lego image after 1000 training steps using the larger MLP net.
Figure 9k
Figure 9k: Predicted Lego image after 2000 training steps using the larger MLP net.
Figure 9l
Figure 9l: Predicted Lego image after 3950 training steps using the larger MLP net.

I used the trained NeRF to predict images taken at 60 perspectives, and rendered them into a video to show the Lego rotating. They are shown in Figure 10.

Figure 10a
Figure 10a: Rendered video of the Lego rotating generated by the smaller MLP net.
Figure 10b
Figure 10b: Rendered video of the Lego rotating generated by the larger MLP net.

It can be obtained that the larger MLP net produces clearer pictures, although it took much longer to train.

B5. Change Background Color (Bells & Whistles)

To change the background color of the predicted images, I modified the volume rendering equation to \[ \begin{align} \hat{C}(\mathbf{r})=\sum_{i=1}^N T_i\left(1-\exp \left(-\sigma_i \delta_i\right)\right) \mathbf{c}_i + T_{i+1}\mathbf{c}_{back}, \text { where } T_i=\exp \left(-\sum_{j=1}^{i-1} \sigma_j \delta_j\right) \end{align}. \] In this way, when the color densities of all sample locations along a ray are low, and the ray gets to arrive at the background, the rendered color would be the background color \(\mathbf{c}_{back}\). Figure 11 shows the rendered rotation video with a red, green, blue background, respectively.

Figure 11a
Figure 10a: Rendered video of the Lego rotating with red background.
Figure 10b
Figure 10b: Rendered video of the Lego rotating with green background.
Figure 10c
Figure 10c: Rendered video of the Lego rotating with blue background.

* Finished on Dec 9, 2024.