CS180-Proj4: Stitching Photo Mosaics

David Wei
david_wei@berkeley.edu

1. Introduction

Project 4 consists two parts: Image Warping and Mosaicing (Part A) and Feature Matching for Autostitching (Part B).

In Part A, I rectified images using Perspective Transform. I manually selected correspondences on the images, and warped them so that their transformed correspondences form a rectangle. I also produced mosaics images by blending pairs of images that overlap with each other. First, I manually matched a few pixels that represent the same corner of an object on the two images. Then, I treated these pixel matches as correspondences, and warped the first image so that after warping, the correspondences on the first image aligns with the correspondences on the second images. In this way, the same objects on the two images would match. Finally, I conducted Alpha Blend on the output mosaic to erase the edge between the two images.

In Part B, I also produced mosaic images, only this time instead of manually matching the pixels, the pixel matches are automatically detected and selected. Corners serve as great symbols of objects on an image, so I used Harris Corner Detector to find the corners on the images, and treat them as interest points. Then, I used Adaptive Non-Maximal Suppression (ANMS) to select a few interest points that are not only high in "corner strength", but also as uniformly distributed in the image as possible. They are the potential correspondences. Later, I matched the potential correspondences using Feature Descriptors. If the best match of a potential correspondence did not score significantly higher than its second-best match, I would abandon this pixel. The matched pixels still may contain error. I found the optimal set of matches using the idea of Random Sample Consensus (RANSAC). At last, similar to Part A, I used the optimal matches to conduct perspective transform on the first image so that it aligns with the second image, and blended the overlapping region to erase the edge.

A1. Image Rectification

Image rectification is a transformation process that projects images onto a plane that faces directly towards a rectangular object in the image. To do this, I manually selected the corners of a rectangular object in the image as the correspondences. I conducted Perspective Transform on the image so that after transformation, these correspondences would form a rectangle. Perspective Transform is denoted as \[ \begin{bmatrix} wx' \\ wy' \\ w \end{bmatrix} = \begin{bmatrix} a & b & c \\ d & e & f \\ g & h & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix}, \] where (x, y) is the pixel coordinate before the transform, (x', y') is its coordinate after the transform, and the middle 3*3 matrix is the Transformation Matrix we wish to obtain.

The transformation matrix can be calculated by solving the equation \[ \begin{bmatrix} x & y & 1 & 0 & 0 & 0 & -xx' & -yx' \\ 0 & 0 & 0 & x & y & 1 & -xy' & -yy' \\ \end{bmatrix} \begin{bmatrix} a \\ b \\ c \\ d \\ e \\ f \\ g \\ h \end{bmatrix} = \begin{bmatrix} x' \\ y' \end{bmatrix}, \] where (x, y) and (x', y') is the coordinates of a correspondence before and after its transform, respectively. The equation should have a single solution when the number of correspondences is fixed to 4. When more than 4 correspondences exist, an approximate solution is reached by solving the equation using the first four correspondences, and Gradient Descent is applied on the transformation matrix to adjust it according to the rest of the correspondences. The loss function is set as the mean-square-loss between the correspondences' desired coordinates after transform and their true post-transform coordinates using the current matrix.

Figure 1 shows the correspondences (blue dots) of a sign image and a sheet image. Their rectified image using these correspondences are shown in Figure 2.

Figure 1a
Figure 1a: Sign image correspondences.
Figure 1b
Figure 1b: Sheet image correspondences.
Figure 2a
Figure 2a: Rectified sign image.
Figure 2b
Figure 2b: Rectified sheet image.

A2. Image Mosaic and Blending

An Image Mosaic is a compound image created by stitching together a series of adjacent pictures, where between each adjacent images, an overlapping region displaying the same objects should exist. In this project, I created mosaic images formed by two overlapping images. I manually matched a number of pixels (no less than 4) on the two images overlapping region. Figure 3 shows the matching pixels on two Sather Tower images I filmed. I used these pixel matches as correspondences, and warped the first image using perspective transform so that its post-transform correspondences aligns with the correspondences on the second image. In this way, the same object in the overlapping region would match. The Sather Tower mosaic image is shown in Figure 4a.

Figure 3
Figure 3: Matching pixels on two Sather Tower images.

Modern cameras (especially phone cameras) would auto-adjust images' brightness, and it is almost impossible to have the same pixel being the exact same color on two different films. To erase the nasty edge between the two images, I conducted Alpha Compositing to blend the overlapping region. The alpha value for each pixel is denoted as \[ \alpha = \frac{d_1}{d_1 + d_2}, \] where d1 is the pixel's Euclidean Distance to its closest non-overlapping point in image 1, while d2 is that of image 2. The distances are quickly obtained using the Distance Transform algorithm. The mosaic image after blending is shown in Figure 4b.

Figure 4a
Figure 4a: Sather Tower mosaic image before blending.
Figure 4b
Figure 4b: Sather Tower mosaic image after blending.

I also produced mosaic image of my messy room and a crossroad on my way to school. They are shown in Figure 5 and 6, respectively.

Figure 5a
Figure 5a: Crossroad left.
Figure 5b
Figure 5b: Crossroad right.
Figure 5c
Figure 5c: Crossroad mosaic image.
Figure 6a
Figure 6a: My room left.
Figure 6b
Figure 6b: My room right.
Figure 6c
Figure 6c: My room mosaic image.

B1. Harris Corner Detector

In Part B, instead of manually matching the pixel's on the two images, I would rely on the computer to find these matching pixels. Corners can be great symbol of objects on an image, as the pixels around a corner can have identical color distribution. I used Harris Corner Detector to extract the corner strength of each pixel. The higher the pixel's corner strength, the more likely the pixel represents a corner in the image. The corner strength of each pixel in an image featuring Moffitt Library (Figure 7a) is shown in Figure 7b.

Figure 7a
Figure 7a: Moffitt Library image.
Figure 7b
Figure 7b: Pixel's Corner Strength.

B2. Adaptive Non-Maximal Suppression (ANMS)

Pixels with higher corner strength are more likely to have an identical color distribution nearby, which may boost the accuracy of matching. However, an optimal set of correspondences for Perspective Transform should also cover as many spaces on the image as possible, and a clustered set of correspondences would often lead to an unsuccessful mosaic. In order to find the pixels that are not only high in corner strength but also not clustering, I used Adaptive Non-Maximal Suppression (ANMS). I ordered the pixels into a list according to their corner strength, and start selecting the potential correspondences from the front of the ordered list. I would only select a pixel if it is no closer than 18 pixels with any of the pixels already selected. The 5000 pixels with the highest corner strength (interest points) are shown as blue dots in Figure 8, along with the selected interest points (potential correspondences) shown as red dots in the image.

Figure 8
Figure 8: Interest points (blue dots) and potential correspondences (red dots) in the Moffitt Library image.

B3. Figure Descriptor and Pixel Matching

With potential correspondences selected on both adjacent images, it is time to match them. Within each match, the potential correspondences should point to the same corner of the same object on the image (the same corner of a building, the same treetop, etc.). Researches have suggested that when the neighbouring pixels of the same part of an object is blurred, the blurring result should remain similar even if the filming angle rotates within a small range. Therefore, sampled the pixels from a patch of radius r = 20 (size 40 * 40) with spacing s = 5 (Feature Descriptor size 8 * 8). The sampled pixels are normalized to a mean of 0 and a standard division of 1. The correspondences with their Feature Descriptor achieving the highest inner product are matched.

When there are similar objects in an image (e.g. similar windows in a building, leaves on a tree), it is easy to match the wrong corner. So I abandoned all pixels whose best match is not significantly higher than its second-best match (The inner product achieved with its second-best match is not at least 40% smaller than that achieved with the best match). The potential correspondences (pink dots) and the selected pixel matches (red dots and lines) in the two adjacent moffitt library images are shown in Figure 9.

Figure 9
Figure 9: Potential correspondence (pink dots) and selected matches (red dots and lines) in the two adjacent moffitt library images.

B4. Random Sample Consensus (RANSAC)

The matched pixels may still contain errors. I adapted the idea of Random Sample Consensus (RANSAC) to find the optimal set of matches to conduct perspective transform. For every approach in 1000 approaches, I would calculate the transformation matrix according to a randomly selected 4 pixel matches, and record the number of outliers for each obtained matrix. An outlier is any correspondences in all the matched pixels that after transform does not align with its match on the second image. Here, having a distance larger than 3 pixels is not considered aligned. The transformation matrix with the least outliers is the optimal transformation matrix.

B5. Image Mosaic and Blending

Using the same technic introduced in Part A, I produced Mosaics of the Moffitt Library (Figure 10), Baker Bioenginuity Hub (Figure 11), and the FU on my roommate's door (Figure 12). I also generated mosaics using the same source images based on manually selected matches. I presented them alongside the fully-automatically-generated mosaics. In my opinion, the mosaics with computer-generated matches seems to be more accurately aligned.

Figure 10a
Figure 10a: Moffitt Library left.
Figure 10b
Figure 10b: Moffitt Library right.
Figure 10c
Figure 10c: Moffitt Library mosaic based on manually selected matches.
Figure 10d
Figure 10d: Moffitt Library mosaic based on computer-generated matches.
Figure 11a
Figure 11a: Baker Bioenginuity Hub left.
Figure 11b
Figure 11b: Baker Bioenginuity Hub right.
Figure 11c
Figure 11c: Baker Bioenginuity Hub mosaic based on manually selected matches.
Figure 11d
Figure 11d: Baker Bioenginuity Hub mosaic based on computer-generated matches.
Figure 12a
Figure 12a: FU left.
Figure 12b
Figure 12b: FU right.
Figure 12c
Figure 12c: FU mosaic based on manually selected matches.
Figure 12d
Figure 12d: FU mosaic based on computer-generated matches.

* Part A finished on Oct 19, 2024; Part B finished on Oct 24, 2024.