My final project consists two parts: Fit a Neural Field to a 2D Image (Part A), and Fit a Neural Radiance Field from Multi-view Images (Part B).
In Part A, I built a Multiplayer Perceptron (MLP) network to fit a single 2D image so that given any pixel's coordinate, the network can predict the pixel's RGB color. When the image's shape is provided, the network can reconstruct the whole image.
In Part B, I trained a larger MLP network to serve as a Neural Radiance Field (NeRF) and used it to fit a 3D Lego object through inverse rendering from multi-view calibrated images. The pixels on the images were bounded with rays represented in 3D world coordinate system. Sample locations were gathered along the rays, and their volume rendering results were used to fit the RGB colors on the images' pixels. In this way the Lego object was modeled into the NeRF. Using the trained NeRF, I'm able to predict the images of the Lego taken from any given perspectives. I rendered these images into a video to create a rotating effect of the Lego.
In Part A, I trained a Neural Field to fit a single 2D image. This means that given any pixel's coordinate, the network can predict the pixels' RGB color. When the image's shape is provided, the network can reconstruct the whole image.
In order to map 2D pixel coordinates to RGB colors, I built a Multilayer Perceptron (MLP) network defined as follows.
The Sinusoidal Positional Encoding (PE) is conducted to expand the dimensionality of the input coordinates. Here the PE operation is defined as \[ PE(x) = \{x, sin(2^0\pi x), cos(2^0\pi x), sin(2^1\pi x), cos(2^1\pi x), ..., sin(2^{L-1}\pi x), cos(2^{L-1}\pi x)\}, \] where L is the highest frequency level.
It is not feasible to train the network with all the pixels in each iteration due to the GPU memory limit, so I implemented a dataloader that randomly samples a batch size of N pixels at every iteration of training. I normalized the pixel coordinates into [0, 1] before feeding it to PE encoding.
I trained my model on an animal image and the famous Lenna image shown in Figure 2.
The model is trained for 2,000 steps on both images, with each step learning from a batch size of N = 10,000 pixels. I used the Adam optimizer with an initial learning rate of 1e-2 for gradient descent. I set the hidden layer size in the MLP net as 256 neurons, and the highest frequency level in PE encoding as L = 10, which would map a 2 dimension coordinate into a 42 dimension vector. I chose the Mean Squared Error (MSE) between the predicted image and the real image as the loss function, and used Peak Signal-to-Noise Ratio (PSNR) to measure the reconstruction of the image. PSNR is defined as \[ PSNR = 10 \cdot log_{10}\left(\frac{1}{MSE}\right). \] The MSE/PSNR graphs for both trainings are shown in Figure 3.
The predicted images after different number of training steps are shown in Figure 4.
I reduced the highest frequency level in PE encoding to L = 5 and the hidden layer size to 128 neurons, and train the MLP network on the animal images again. Figure 5 shows the result comparison between the reduced network and the original network.
The predicted image generated by the reduced MLP net is hardly accurate. This indicates that an abundant network size is necessary to yield good result.
In Part B, I was given 100 images of the same Lego object taken at different viewpoint, along with the camera's coordinates and angles when the images were taken. I modeled the 3D Lego object using a Neural Radiance Field, and use it to predict images of the Lego object taken at new perspectives.
To generate a model that represents a 3D object via multi-view calibrated images, I need to transform a 2D pixel coordinate in the camera space to a ray in 3D space represented by world space coordinate. The real world correspondences that caused in the RGB color of that pixel should be located along that ray.
First, I used the intrinsic matrix \(\mathbf{K}\) to transform 2D pixel coordinate to 3D coordinate in camera space. The intrinsic matrix in defined as \[ \begin{align} \mathbf{K} = \begin{bmatrix} f_x & 0 & o_x \\ 0 & f_y & o_y \\ 0 & 0 & 1 \end{bmatrix} \end{align}, \] where \((f_x, f_y)\) is the camera's focal length, and \((o_x, o_y)\) is the camera's principal point, defined as \[ o_x = \text{image_width} / 2; \] \[ o_y = \text{image_height} / 2. \] The projection from a 2D location \((u, v)\) in pixel coordinate to a 3D point \((x_c, y_c, z_c)\) in the camera coordinate system is denoted as \[ \begin{align} s \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = \mathbf{K} \begin{bmatrix} x_c \\ y_c \\ z_c \end{bmatrix} \end{align}, \] in which \(s=z_c\) is the depth of this point along the optical axis.
Then, I used the rotation matrix \(\mathbf{R}_{3\times3}\) and a translation vector \(\mathbf{t}\) to transform 3D camera space coordinate \(\mathbf{X_c} = (x_c, y_c, z_c)\) to 3D world space coordinate \(\mathbf{X_w} = (x_w, y_w, z_w)\), denoted as \[ \begin{align} \begin{bmatrix} x_c \\ y_c \\ z_c \\ 1 \end{bmatrix} = \begin{bmatrix} \mathbf{R}_{3\times3} & \mathbf{t} \\ \mathbf{0}_{1\times3} & 1 \end{bmatrix} \begin{bmatrix} x_w \\ y_w \\ z_w \\ 1 \end{bmatrix} \end{align}. \]
A ray can be defined by an origin \(\mathbf{r}_o \in R^3\) vector and a direction \(\mathbf{r}_d \in R^3\) vector. I want to know the \(\{\mathbf{r}_o, \mathbf{r}_d\}\) for every pixel \((u, v)\). The origin \(\mathbf{r}_o\), or the location of the camera, is \[ \begin{align} \mathbf{r}_o = -\mathbf{R}_{3\times3}^{-1}\mathbf{t} \end{align}. \] To calculate the ray direction for pixel \((u, v)\), I chose a point along this ray with depth equals 1 \((s=1)\) and find its coordinate in the world space \(\mathbf{X_w} = (x_w, y_w, z_w)\). Then the normalized ray direction can be computed by \[ \begin{align} \mathbf{r}_d = \frac{\mathbf{X_w} - \mathbf{r}_o}{||\mathbf{X_w} - \mathbf{r}_o||_2} \end{align}. \] In this way, I got to compute the corresponding ray \(\{\mathbf{r}_o, \mathbf{r}_d\}\) for every pixel \((u, v)\) in each camera.
Similar to part A, I randomly sampled a batch size of N rays from different cameras. I discretized each ray into a number of 3D points. The points are sampled uniformly along each ray between a "near point" and a "far point". During training, I would add a minor perturbation to the points' coordinates. Figure 6 shows the sampled rays (black line) and points (red dot) during each step of training. Their coordinates will be fed to the MLP net as inputs. I reduced the batch size (number of rays sampled) to N = 10 to produce a less crowded figure.
The pixel's color captured by the camera is the combined effort of all objects along the pixel's corresponding ray. The color can be represented by the volume rendering equation \[ \begin{align} C(\mathbf{r})=\int_{t_n}^{t_f} T(t) \sigma(\mathbf{r}(t))\mathbf{c}(\mathbf{r}(t), \mathbf{d}) d t, \text { where } T(t)=\exp \left(-\int_{t_n}^t \sigma(\mathbf{r}(s)) d s\right)\end{align}. \] The discrete approximation of this equation can be stated as \[ \begin{align} \hat{C}(\mathbf{r})=\sum_{i=1}^N T_i\left(1-\exp \left(-\sigma_i \delta_i\right)\right) \mathbf{c}_i, \text { where } T_i=\exp\left(-\sum_{j=1}^{i-1} \sigma_j \delta_j\right) \end{align}, \] where \(\textbf{c}_i\) is the RGB color obtained at location \(i\), \(T_i\) is the probability of a ray not terminating before sample location \(i\), and \(1 - e^{-\sigma_i \delta_i}\) is the probability of a ray terminating at sample location \(i\), in which \(\sigma_i\) is the color density at location \(i\), and \(\delta_i\) is the step size, or the distance between each sample location.
I wish to train a Neural Radiance Field where the rendered color along a ray would match the color the camera captured. The smaller the difference between the rendered color and its corresponding true pixel color, the better the model fits the 3D object.
I built a larger MLP network to fit a Neural Radiance Field to a 3D object. It takes the 3D world coordinates and the normalized ray direction as inputs, and predicts the RGB color of that sample location and its density. Figure 7 shows the architecture of the MLP net.
I trained two MLP models. Each hidden layers in the two MLP net contains 256 and 1,024 neurons, respectively. The smaller MLP net was trained for 2,000 steps while the larger one was trained for 4,000 steps. For each training step, the smaller model randomly samples 10,000 rays with 32 sample locations on each ray, while the larger net randomly samples 5,000 rays with 64 sample locations on each ray. The highest frequency level for PE encoding was set as L = 10 in the smaller net, and L = 20 in the larger net. For both models, I used the Adam optimizer with an initial learning rate of 5e-4 for gradient descent. Similar to part A, I chose MSE between the rendered color and the true pixel color as loss function, and used PSNR to measure the reconstruction of the 3D object. The MSE/PSNR graphs for the two approaches are as follows.
The PSNR of the first approach reached 23, while the second reached 27.
Figure 9 shows the net's predicted image of the Lego object when looked from behind, after different numbers of training steps.
I used the trained NeRF to predict images taken at 60 perspectives, and rendered them into a video to show the Lego rotating. They are shown in Figure 10.
It can be obtained that the larger MLP net produces clearer pictures, although it took much longer to train.
To change the background color of the predicted images, I modified the volume rendering equation to \[ \begin{align} \hat{C}(\mathbf{r})=\sum_{i=1}^N T_i\left(1-\exp \left(-\sigma_i \delta_i\right)\right) \mathbf{c}_i + T_{i+1}\mathbf{c}_{back}, \text { where } T_i=\exp \left(-\sum_{j=1}^{i-1} \sigma_j \delta_j\right) \end{align}. \] In this way, when the color densities of all sample locations along a ray are low, and the ray gets to arrive at the background, the rendered color would be the background color \(\mathbf{c}_{back}\). Figure 11 shows the rendered rotation video with a red, green, blue background, respectively.
* Finished on Dec 9, 2024.