SIFT is an algorithm from the paper Distinctive Image Features from Scale-Invariant Keypoints (Lowe, 2004). The work later turned into one of its best applications, image stitching, as proposed in Automatic Panoramic Image Stitching using Invariant Features(Brown & Lowe, 2007). The Assignment 1 of EE5731 Visual Computing is based on these two papers.
Itβs the third feature introduced in the module EE5731, after Haar-like features and HOG. Although Scale-Invariant is highlighted in its title, the SIFT features are way more powerful than that. These features are also resistant to image rotation and changes in lighting, and they even hold up pretty well when the 3D viewpoint of the camera changes a bit.

Overview
Following are the major stages of computation used to generate the set of image features.
Contributions
Scale-space extrema detection
The first stage of computation searches over all scales and image locations. It is implemented efficiently by using a difference-of-Gaussian function to identify potential interest points that are invariant to scale and orientation.
Keypoint localization
At each candidate location, a detailed model is fit to determine location and scale. Keypoints are selected based on measures of their stability.
Orientation assignment
One or more orientations are assigned to each keypoint location based on local image gradient directions. All future operations are performed on image data that has been transformed relative to the assigned orientation, scale, and location for each feature, thereby providing invariance to these transformations.
Keypoint descriptor
The local image gradients are measured at the selected scale in the region around each keypoint. These are transformed into a representation that allows for significant levels of local shape distortion and change in illumination.
Scale-space extrema: Difference of Gaussian
Input: an image
Output: the DoG and the locations of the scale space extrema in each DOG
When it comes to selecting anchor points from an image, we prefer those stable ones, usually the extreme points or corners. Itβs easy to find the extreme points that have larger pixel values than neighboring pixels. But this will also include noise pixels and pixels on the edges, making the features unstable. Stable means the feature points can be repeatably assigned under different views of the same object. Also, we hope the same object can give close feature descriptors to simplify the matching. The design of descriptors will be introduced later.
Scale-space extrema are those extrema points coming from the difference-of-Gaussian (DoG) pyramids. Each pyramid is called an octave, which is formed by filtered images using Gaussian kernels. An octave of Gaussian filtered images can create difference-of-Gaussian images. Then we rescale the image, down-sample it by a factor of 2, and repeat the process.
For an image at a particular scale, is the convolution of a variable-scale Gaussian, :
The Gaussian blur in two dimensions is the product of two Gaussian functions:
Note that the formula of a Gaussian function in one dimension is:
The difference-of-Gaussian is:
The DoG function is a close approximation to the scale-normalized Laplacian of Gaussian (LoG) function. Itβs proved that the extrema of LoG produces the most stable image features compared to a range of other possible image functions, such as the gradient, Hessian, or Harris corner function.
The maxima and minima of the difference-of-Gaussian images are then detected by comparing a pixel to its 26 neighbors at the current and adjacent scales (8 neighbors in the current image and 9 neighbors in the scale above and below).

Keypoint localization: Taylor expansion
Input: the locations of the scale space extrema from a DoG
Output: refined interpolated locations of the scale space extrema
Simply using the locations and the pixel values of the keypoints we get from 3.1 will not make the algorithm become invalid. However, the noise pixels will also give high response and be detected as keypoints in DoG.
The Taylor expansion with quadratic terms of is used to find out the location of the real extrema, (or , is the offset).
Let be the location of the keypoint in the DoG with variance , we have:
See Wikipedia for more help on vector multiplication. By setting , we have the extremum:
Since the computation only involves a 3x3 block around the keypoint candidate , we can copy the block and then set the center as original, . Then becomes the offset.
In case is larger than 0.5 on any dimension, it implies that the actual extremum is another pixel rather than . If so, we can set as the new center and fetch a new 3x3 block around it, repeat the calculation above till all the dimensions of are no larger than 0.5. With the final extremum, we can update the keypoints with the refined locations.
Nitty-gritty Ahead
Thresholding has been used for multiple times in SIFT. By thresholding we can get to know whether a keypoint is just noise; we can even tell if it lies on a vertex, which in that case, gives more confidence that it has rich geometric information.
Contrast threshold
Input: the refined location of a keypoint
Output: the contrast of the keypoint and the decision on whether itβs a noise pixel or not
The extremum location has another use in noise rejection. Most of the additional noise is not that strong. If the interpolated amplitude of the keypoint on DoG is less than 0.3, then itβs dropped out as a noise pixel.
Note that the image is normalized to from .
Edge threshold
Input: the refined location of a keypoint
Output: whether itβs on a line or a vertex
Using the 3x3 block around the keypoint we can also compute a 2x2 matrix called Hessian matrix. The trace and determinant can be represented as the sum and the product of the 2 eigenvalues, and (say, ). They are also called principle curvatures. is the maximum curvature of the point and is the minimum curvature.
Let , then:
The empirical threshold is . If
, then is more likely that the keypoint lies on a line.
Consider the image as a 3D surface. The height of a point on it is the pixel value of . If the point lies on a line then itβs on a ridge or valley, making .

Orientation assignment
Input: the image location, scale, and orientation of a keypoint
Output: an orientation histogram
SIFT deals with image rotation by analyzing the local image gradient around each keypoint. This allows us to assign a consistent orientation to each keypoint, which becomes our reference point. The keypoint descriptor is then built relative to this orientation, making it immune to image rotation.
Keypoint descriptor
Input: the refined location of a keypoint
Output: a 4x4x8 = 128 element feature vector
In this part, I used SIFT functions from both siftDemoV4
and VLFeat
to generate the keypoints and descriptors, then show them over the original image. siftDemoV4
provides us with function sift
and showkeys
.

The result generated by VLFeat
is relatively cleaner. It can randomly select n keypoints for descriptor visualization. In this case n=50:

Whatβs next
In , two more techniques will join the workflow of SIFT to achieve among multiple images. Go check that out!