Local features

Faisal Qureshi
Faculty of Science
Ontario Tech University
Oshawa ON Canada

© Faisal Qureshi


Creative Commons Licence
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Lesson Plan

  • Characteristics of a good local feature
  • Raw patches as local features
  • SIFT descriptor
  • Feature detection and matching in OpenCV
  • Blob detection
  • MSER in OpenCV
  • Applications of local features


Consider image stitching. It requires that we find corresponding "locations" in two images. Given these corresponding locations, we can compute homography, which would allow us to stitch the two images to construct a panorama.

Review: Characteristics of a Good Feature

  • Repeatability — feature is invariant to geometric, lighting, etc. changes
  • Saliency or distinctiveness
  • Compactness — efficiency, many fewer features than the number of pixels in the image
  • Locality — robustness to clutter and occlusion, a feature should only occupy a small area of an image


We need to find at least some of the same points in two images to any chance of finding true matches. There is little chance that we can find corresponding locations given the following two images.

Detection process run independently on two images should return at least some of the corresponding locations as seen below.

Recall that we have attempted to address this issue by interest point detection. These are locations in the image that are (somewhat) "invariant" to geometric and photometric changes. Specifically, we identified corner locations as those that are covariant to translation and rotation and partially invariant to changes in intensity. Recall also that corner detection is not invariant to changes in scale.

Observation 1: identify interest points locations (say, through corner detection) and construct local features around these locations.

Interest point detectors

Available interest point detectors

  • Hessian & Harris [Beaudet 78], [Harris 88]
  • Laplacian, Difference of Gaussian (DoG) [Lindeberg 98], [Lowe 99]
  • Harris-/Hessian-Laplace [Mikolajczyk & Schmid 01]
  • Harris-/Hessian-Affine [Mikolajczyk & Schmid 04]
  • Edge-based Region Detector (EBR) and (Intensity Extrema Based Region Detector) IBR [Tuytelaars & Van Gool 04]
  • Maximally Stable Extremal Regions (MSER) [Matas 02]
  • Salient Regions [Kadir & Brady 01]
  • and many others

Which interst point detector should you choose?

What do you want it for?

  • Precise localization in x-y: Harris
  • Good localization in scale: DoG
  • Flexible region shape: MSER

Best choice often application dependent

  • Harris-/Hessian-Laplace/DoG work well for many natural categories
  • MSER works well for buildings and printed things

Take home lesson

  • There have been extensive evaluations/comparisons [Mikolajczyk et al., IJCV 05, PAMI 05]. Best to check these out and select the best interest point detector for your application.
  • It is sometimes useful to use multiple detectors simultaneously to help with matching over a range of image categories

We will soon see that deep learning has revolutionized image feature construction. More on this later.


We want to reliably determine which location in one image goes with which location in the second image. The computed features should be invariant to geometric and photometric differences between the two images. Consider the following figure

Our task is to find the corresponding locations in the two images. This means that we need to figure out which of the two locations in the image on the right matches with the location shown in the image on the left.

Observation 2: compute descriptors that encode the area surrounding an interest point. These descriptors should be compact (for computational reasons), these should have local support, and these should have invariance properties with respect to geometric and photometric changes.

? Why do we want to encode local region around an interest point. Why not encode the entire image?

Local feature descriptors

Encode area around interest points as vectors. We can then easily match these features to identify the corresponding locations between the two images. The following figure illustrates this idea. Here, local area around three interest point locations (one in the left image, and two in the right image) is encoded as $d$-dimensional vectors.

We can find the corresponding location by matching these $d$-dimensional vectors. There are many options for doing so. E.g., we can uses sum-of-squared differences (SSD) to match these vectors. Alternately, we can use cosine similarity. And there are many other techniques for matching vectors.

Computing local feature descriptors

Invariance to translation, rotation, and scale.

Invariance to changes in intensity and color.

(Figures courtesy T. Tuytelaars ECCV 2006 tutorial)

Raw patches as local descriptors

The simplest way to describe the neighborhood around an interest point is to write down the list of intensities to form a feature vector.
Consider the figure below.

The image patch around the interest point locations (depicted by the red circles) are as seen below.

Lets write down the list of intensities in these patches to form the feature vector

In [1]:
import numpy as np
import cv2 as cv
import matplotlib.pyplot as plt
In [2]:
left_patch = cv.imread('data/local-features-construction-2.jpg')
left_patch = cv.cvtColor(left_patch, cv.COLOR_BGR2RGB)
left_patch = cv.resize(left_patch, (32, 32), interpolation=cv.INTER_NEAREST)

right_patch = cv.imread('data/local-features-construction-1.jpg')
right_patch = cv.cvtColor(right_patch, cv.COLOR_BGR2RGB)
right_patch = cv.resize(right_patch, (32, 32), interpolation=cv.INTER_NEAREST)

<matplotlib.image.AxesImage at 0x12b45efd0>

Notice that listing intensities is very sensitiive to changes in rotation, scale, intensity, etc. These make poor feature descriptors.

Shift Invariant Feature Transform (SIFT) [Lowe 2004]

Description taken from various places, including https://sbme-tutorials.github.io/2019/cv/notes/7_week7.html

SIFT Pyramid

Construct SIFT pyramid, which consists of Octaves and Scales. Octaves are different levels of image resolutions (pyramids levels), and scales represent different scales of window in each octave level (different $\sigma$ of Gaussian window)

Key-point localization in scale

At each scale compare cornerness with neighbouring scales (upper and lower scales) and pick the scale with maximum cornerness value. Not all corners in an image are localized at the same scale.

Computing SIFT descriiptor

  • Use image gradients instead of raw intensities
  • Use histograms to bin pixels (gradients) within sub-patches according to their orientation.
  • Use of image gradients provides partial invariance to changes in illumination
  • Using subpatches maintains spatial structure.

Achieving rotation invariance

Rotate patch according to its dominant gradient orientation. This puts the patches into a canonical orientation.

(Image from Matthew Brown.)

See below for how to find the dominant gradient orientation.

Computing SIFT descriiptor

  • Use image gradients instead of raw intensities
  • Use histograms to bin pixels (gradients) within sub-patches according to their orientation.
  • Use of image gradients provides partial invariance to changes in illumination
  • Using subpatches maintains spatial structure.

128-dimensional SIFT Descriptors

After localization of a key-point in our scale space. We can get its SIFT descriptor as follow

  • Extract a $16 \times 16$ window centered by this point.
  • Get gradient magnitude and multiply it by a $16 \times 16$ Gaussian window of $\sigma=1.5$
  • Get gradient angle direction.
  • Adjusting orientation (To be rotation invariant): get the gradient angle of the window and Quantize them to 36 values $(0, 10, 20, \cdots, 360)$
  • Locate dominant corner direction which is most probable angle (angle with max value in 36 bit angle histogram) subtract dominant direction from gradient angle.
  • For each block get magnitude weighted angle histogram and normalize it (divide by total gradient magnitudes). Here angles are quantized to 8 angles [0, 45, 90, … , 360] based on its relevant gradient magnitude i.e (histogram of angle 0 = sum(all magnitudes with angle 0)).

SIFT properties

  • Extraordinarily robust matching technique
  • Can handle changes in viewpoint of up to about 60 degree out of plane rotation
  • Can handle significant changes in illumination

(Image from Steve Seitz)
  • Fast and efficient—can run in real time

NASA Mars Rover images with SIFT feature matches. (Figure by Noah Snavely.)

SIFT summary

  • Invariant to scale and rotation
  • Partially invariant to illumination changes, camera viewpoint, occlusion, and clutter


SIFT was initially included in OpenCV; however, it is no longer available. Since SIFT was patented. An option is to use VLfeat library, which includes a SIFT implementation. VLfeat currently doesn't have a "stable" Python biding. Still you are welcome to try it using pip install pyvlfeat.

Matching local features

Consider the figures below. "SIFT patches" are overlaid on the two images. Our goal is to generate candidate matches.

Classical feature descriptors, such as SIFT, SURF, etc., are usually compared and matched using the Euclidean distance (or L2-norm). Other techniques for matching these features are Cosine similarity, Earth Mover's Distance (also known as Wasserstein Distance), etc.

Distance computation in Python

Check out scipy.spatial.distance module for various methods for computing distance matrix for a collection of raw observation vectors stored in a rectangular array.

See here for more information.

Exercise 1

Compute Cosine and Euclidean distance matrix between three vectors $[1,0,0]$, $[0,1,0]$, $[1,1,0]$, and $[10,-2,1]$

In [5]:
# %load solutions/local-features/solution-01.py

From distance to similarity

How do we convert distance values to similarity values. For cosine distance, simply subtract cosine distance from 1.0. In general if your distance metric returns values between 0 and 1, then you can use this trick.

Exercise 2

Compute Cosine similarity matrix between $[1,0,0]$, $[0,1,0]$, $[1,1,0]$, and $[10,-2,1]$

In [8]:
# %load solutions/local-features/solution-02.py

Gaussian kernel to convert distance to similarity

For other distances, we can use, say, a Gaussian kernel as follows:

$$ K(d) = \exp \left( \frac{d^2} {2 \sigma^2} \right), $$

where $d$ is the distance between two vectors $\mathbf{x}_1$ and $\mathbf{x}_2$. $\sigma$ is a tuning (or scaling) parameter. If $\sigma$ is high, $K(d)$ will be close to $1$ (i.e., high similarity) for large values of $d$. If $\sigma$ is small, even a small $d$ will reduce the similarity scores for the two vectors.

Exercise 3

Compute similarity matrix between $[1,0,0]$, $[0,1,0]$, $[1,1,0]$, and $[10,-2,1]$. Assume Euclidean distance metric.

In [11]:
# %load solutions/local-features/solution-03.py

Wasserstein distance

Wasserstien distance is computed between two probability distributions (below represented as histograms). Check out scipy.stats module for methods for computing Wasserstien distance.

In [12]:
from scipy.stats import wasserstein_distance
wasserstein_distance([0, 1, 3], [5, 6, 8])

Hamming distance

We now also have binary feature descritors, such as ORB, BRISK, which are matched using Hamming distance.

$$ d_{\mathrm{hamming}} (\mathbf{a}, \mathbf{b}) = \sum_{i=0}^{n-1} (a_i \oplus b_i) $$


  • SIFT descriptors represent the histogram of oriented gradient in a neighbourhood
  • SURF descriptors represent the histogram of of the Haar wavelet response in a neighborhood
  • See here) for more information about Binary Robust Independent Elementary Features (BRIEF)
  • Oriented Fast and Rotated Brief (ORB) [Ethan Rublee et al. 2011]

Bruteforce matching

Compare them all, take the closest (or closest $k$, or within a thresholded distance).

In [13]:
import numpy as np
import cv2 as cv
import matplotlib.pyplot as plt

img1 = cv.imread('data/box.png',cv.IMREAD_GRAYSCALE)          
img2 = cv.imread('data/box_in_scene.png',cv.IMREAD_GRAYSCALE) 
(223, 324)
In [14]:
orb = cv.ORB_create()
kp1, des1 = orb.detectAndCompute(img1, None) # locations and descriptor
kp2, des2 = orb.detectAndCompute(img2, None)
In [15]:
bf = cv.BFMatcher(cv.NORM_HAMMING, crossCheck=True)
matches = bf.match(des1, des2)
matches = sorted(matches, key = lambda x:x.distance)
In [16]:
img3 = cv.drawMatches(img1, kp1, img2, kp2, matches[:10], None, flags=cv.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS)


KDTree data structure

See here for more information.

In [20]:
# %load solutions/local-features/kdtree.py

Ambiquous matches

Lets consider SSD metric for finding matches. How do we threshold on SSD? One approach is to compute the ratio of the distance to best match to distance to the second best match. If this ratio is low, the best match is a good candidate. If this ratio is high, then the best match could be an ambiguous match.

(Figure from Lowe 2004)

FLANN matching

Check this for more information.

(From OpenCV documentation) FLANN stands for Fast Library for Approximate Nearest Neighbors. It contains a collection of algorithms, such as KDTree, Locality Sensitive Hashing, etc., optimized for fast nearest neighbor search in large datasets and for high dimensional features. It works more faster than BFMatcher for large datasets.

For OpenCV implementation, possible values are:

In [21]:
import numpy as np
import cv2 as cv
import matplotlib.pyplot as plt

img1 = cv.imread('data/box.png',cv.IMREAD_GRAYSCALE)          
img2 = cv.imread('data/box_in_scene.png',cv.IMREAD_GRAYSCALE) 
In [22]:
orb = cv.ORB_create()
kp1, des1 = orb.detectAndCompute(img1, None) # locations and descriptor
kp2, des2 = orb.detectAndCompute(img2, None)
In [23]:
index_params = dict(algorithm = FLANN_INDEX_LSH, table_number = 6)
search_params = dict(checks=50)
flann = cv.FlannBasedMatcher(index_params, search_params)
In [24]:
matches = flann.knnMatch(des1, des2, k=2)
In [25]:
good = []
for match in matches:
    if len(match) < 2: continue
    m, n = match[0], match[1]
    if m.distance < 0.75*n.distance:
In [26]:
img3 = cv.drawMatchesKnn(img1, kp1, img2, kp2, good, None, flags=cv.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS)

Blob detection

Edge detection review

(Figure from Steve Seitz).

Second derivative of Gaussian (Laplacian)

(Figure from Steve Seitz).

From edges to blobs

(Figure from Lana Lazebnik).
  • Edge = ripple
  • Blob = superposition of two ripples
  • Spatial selection: the magnitude of the Laplacian response will achieve a maximum at the center of the blob, provided the scale of the Laplacian is "matched" to the scale of the blob

Blob detection in 2D

Laplacian of Gaussian is circularly symmetric operator for blob detection in 2D.

$$ \nabla^2 g = \frac{\partial^2 g}{\partial x^2} + \frac{\partial^2 g}{\partial y^2} $$

(Figure from Lana Lazebnik).

Characteristic scale

We define the characteristic scale as the scale that produces peak of Laplacian response.

(Figure from Lana Lazebnik).

Difference of Gaussian

We can approximate Laplacian as Difference of Gaussian (DOG), which much more efficient to compute.

Blob detection example in OpenCV

The relevant parameters are described below:

  • Area: filter the blobs based on size
  • Circularity: a measure of how close the blob is to a circle. Circularity is defined by $\frac{4 \pi\ \mathrm{area}}{\mathrm{parameter}}$.
  • Convexity: the ratio of area of the blob and the are of its convex hull. Convexity values lie between $0$ and $1$, inclusive.
  • Inertia: the measure of "ellipseness" of a shape. A circle has inertia of $1$ and a line has inertia of $0$. The inertia of an ellipse lies somewhere between $0$ and $1$.
In [27]:
import cv2 as cv
import numpy as np
import matplotlib.pyplot as plt

filename = "data/butterfly.jpg"
#filename = "data/BlobTest.jpg"
im = cv.imread(filename)
im = cv.cvtColor(im, cv.COLOR_BGR2RGB)
In [28]:
params = cv.SimpleBlobDetector_Params()
params.minThreshold = 10 # Change thresholds
params.maxThreshold = 250
params.filterByArea = False # Filter by Area.
params.minArea = 100
params.filterByCircularity = False # Filter by Circularity
params.minCircularity = 0.1
params.filterByConvexity = False # Filter by Convexity
params.minConvexity = 0.9
params.filterByInertia = False # Filter by Inertia
params.minInertiaRatio = 0.9
detector = cv.SimpleBlobDetector_create(params)
In [29]:
keypoints = detector.detect(im)
In [30]:
im_with_keypoints = cv.drawKeypoints(im, keypoints, np.array([]), (255,0,0), cv.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS)

<matplotlib.image.AxesImage at 0x13f1a0d50>