SIFT is an algorithm from the paper Distinctive Image Features from Scale-Invariant Keypoints (Lowe, 2004). The work later turned into one of its best applications, image stitching, as proposed in Automatic Panoramic Image Stitching using Invariant Features(Brown & Lowe, 2007). The Assignment 1 of EE5731 Visual Computing is based on these two papers.

It’s the third feature introduced in the module EE5731, after Haar-like features and HOG. Although Scale-Invariant is highlighted in its title, the SIFT features are way more powerful than that. These features are also resistant to image rotation and changes in lighting, and they even hold up pretty well when the 3D viewpoint of the camera changes a bit.

Overview

Following are the major stages of computation used to generate the set of image features.

Contributions

Scale-space extrema detection

The first stage of computation searches over all scales and image locations. It is implemented efficiently by using a difference-of-Gaussian function to identify potential interest points that are invariant to scale and orientation.

Keypoint localization

At each candidate location, a detailed model is fit to determine location and scale. Keypoints are selected based on measures of their stability.

Orientation assignment

One or more orientations are assigned to each keypoint location based on local image gradient directions. All future operations are performed on image data that has been transformed relative to the assigned orientation, scale, and location for each feature, thereby providing invariance to these transformations.

Keypoint descriptor

The local image gradients are measured at the selected scale in the region around each keypoint. These are transformed into a representation that allows for significant levels of local shape distortion and change in illumination.

Scale-space extrema: Difference of Gaussian

Input: an image $I (x, y)$

Output: the DoG $D (x, y, σ)$ and the locations of the scale space extrema in each DOG

When it comes to selecting anchor points from an image, we prefer those stable ones, usually the extreme points or corners. It’s easy to find the extreme points that have larger pixel values than neighboring pixels. But this will also include noise pixels and pixels on the edges, making the features unstable. Stable means the feature points can be repeatably assigned under different views of the same object. Also, we hope the same object can give close feature descriptors to simplify the matching. The design of descriptors will be introduced later.

Scale-space extrema are those extrema points coming from the difference-of-Gaussian (DoG) pyramids. Each pyramid is called an octave, which is formed by $s$ filtered images using $s$ Gaussian kernels. An octave of $s$ Gaussian filtered images can create $s - 1$ difference-of-Gaussian images. Then we rescale the image, down-sample it by a factor of 2, and repeat the process.

For an image $I (x, y)$ at a particular scale, $L (x, y, σ)$ is the convolution of a variable-scale Gaussian, $G (x, y, σ)$ :

L (x, y, σ) = G (x, y, σ) * I (x, y)

The Gaussian blur in two dimensions is the product of two Gaussian functions:

G (x, y, σ) = \frac{1}{2 π σ ^{2}} e^{- \frac{x ^{2} + y ^{2}}{2 σ ^{2}}}

Note that the formula of a Gaussian function in one dimension is:

G (x, σ) = \frac{1}{2 π σ ^{2}} e^{- \frac{x ^{2}}{2 σ ^{2}}}

The difference-of-Gaussian $D (x, y, σ)$ is:

D (x, y, σ) = L (x, y, kσ) - L (x, y, σ)

The DoG function is a close approximation to the scale-normalized Laplacian of Gaussian (LoG) function. It’s proved that the extrema of LoG produces the most stable image features compared to a range of other possible image functions, such as the gradient, Hessian, or Harris corner function.

The maxima and minima of the difference-of-Gaussian images are then detected by comparing a pixel to its 26 neighbors at the current and adjacent scales (8 neighbors in the current image and 9 neighbors in the scale above and below).

Keypoint localization: Taylor expansion

Input: the locations of the scale space extrema from a DoG

Output: refined interpolated locations of the scale space extrema

Simply using the locations and the pixel values of the keypoints we get from 3.1 will not make the algorithm become invalid. However, the noise pixels will also give high response and be detected as keypoints in DoG.

The Taylor expansion with quadratic terms of $D (x, y, σ)$ is used to find out the location of the real extrema, $(\overset{x}{^}, \overset{y}{^})$ (or $(x + \overset{x}{^}, y + \overset{y}{^})$ , $(\overset{x}{^}, \overset{y}{^})$ is the offset).

Let $x = (x, y, σ)^{T}$ be the location of the keypoint $(x, y)$ in the DoG with variance $σ$ , we have:

D (x, y, σ) = D (x) \approx D + (\frac{\partial D}{\partial x})^{T} x + \frac{1}{2} x^{T} \frac{\partial ^{2} D}{\partial x ^{2}} x

See Wikipedia for more help on vector multiplication. By setting $D (x) = 0$ , we have the extremum:

\hat{x} = - (\frac{\partial ^{2} D}{\partial x ^{2}})^{- 1} \frac{\partial D}{\partial x}

Since the computation only involves a 3x3 block around the keypoint candidate $x$ , we can copy the block and then set the center as original, $(0, 0)$ . Then $\hat{x}$ becomes the offset.

In case $\hat{x}$ is larger than 0.5 on any dimension, it implies that the actual extremum is another pixel rather than $x$ . If so, we can set $\hat{x}$ as the new center and fetch a new 3x3 block around it, repeat the calculation above till all the dimensions of $\hat{x}$ are no larger than 0.5. With the final extremum, we can update the keypoints with the refined locations.

Nitty-gritty Ahead

Thresholding has been used for multiple times in SIFT. By thresholding we can get to know whether a keypoint is just noise; we can even tell if it lies on a vertex, which in that case, gives more confidence that it has rich geometric information.

Contrast threshold

Input: the refined location of a keypoint $\hat{x}$

Output: the contrast of the keypoint and the decision on whether it’s a noise pixel or not

The extremum location $\hat{x}$ has another use in noise rejection. Most of the additional noise is not that strong. If the interpolated amplitude of the keypoint on DoG is less than 0.3, then it’s dropped out as a noise pixel.

∣ D (\hat{x}) ∣ = ∣ D + \frac{1}{2} (\frac{\partial D}{\partial x})^{T} \hat{x} ∣

Note that the image is normalized to $[0, 1]$ from $[0, 255]$ .

Edge threshold

Input: the refined location of a keypoint $\hat{x}$

Output: whether it’s on a line or a vertex

Using the 3x3 block around the keypoint we can also compute a 2x2 matrix called Hessian matrix. The trace and determinant can be represented as the sum and the product of the 2 eigenvalues, $α$ and $β$ (say, $α > β$ ). They are also called principle curvatures. $α$ is the maximum curvature of the point and $β$ is the minimum curvature.

H = [D_{xx} D_{y x} D_{x y} D_{yy}]

T r (H) = D_{xx} + D_{yy} = α + β

De t (H) = D_{xx} D_{yy} - D_{x y} D_{y x} = D_{xx} D_{yy} - D_{x y}^{2} = α β

Let $r = α / β$ , then:

\frac{T r ( H ) ^{2}}{De t ( H )} = \frac{( α + β ) ^{2}}{α β} = \frac{( r + 1 ) ^{2}}{r}

The empirical threshold is $r = 10$ . If

\frac{T r ( H ) ^{2}}{De t ( H )} > \frac{( r + 1 ) ^{2}}{r}, r = 10

, then is more likely that the keypoint lies on a line.

Consider the image as a 3D surface. The height $z$ of a point on it is the pixel value of $(x, y)$ . If the point lies on a line then it’s on a ridge or valley, making $∣ α ∣ ≫ β$ .

Orientation assignment

Input: the image location, scale, and orientation of a keypoint

Output: an orientation histogram

SIFT deals with image rotation by analyzing the local image gradient around each keypoint. This allows us to assign a consistent orientation to each keypoint, which becomes our reference point. The keypoint descriptor is then built relative to this orientation, making it immune to image rotation.

Keypoint descriptor

Input: the refined location of a keypoint $\hat{x}$

Output: a 4x4x8 = 128 element feature vector

In this part, I used SIFT functions from both siftDemoV4 and VLFeat to generate the keypoints and descriptors, then show them over the original image. siftDemoV4 provides us with function sift and showkeys.

The result generated by VLFeat is relatively cleaner. It can randomly select n keypoints for descriptor visualization. In this case n=50:

What’s next

In Panoramic Image Stitching, two more techniques will join the workflow of SIFT to achieve Automatic Image Stitching among multiple images. Go check that out!

References

Brown, M., & Lowe, D. G. (2007). Automatic panoramic image stitching using invariant features. International Journal of Computer Vision, 74, 59–73.

Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60, 91–110.

🐿️ Draftz

Explorer

Keypoint Extraction with SIFT