Faisal Qureshi
Professor
Faculty of Science
Ontario Tech University
Oshawa ON Canada
http://vclab.science.ontariotechu.ca
© Faisal Qureshi
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Structure and depth are inherently ambiguous from single views.
Figures from Lana Lazebnik.
Notice that 3D points $a$, $b$, $c$ all project to the same location $a'=b'=c'$ in the image. This suggests that there is no straightforward scheme of estimating the depth given an image.
Human eyes fixate on points in space--rotate so that corresponding images form in the centers of fovea
From Bruce and Green, Visual Perception, Physiology, Psychology and Ecology
Disparity occurs when eyes fixate on one object; others appear at different visual angles
From Bruce and Green, Visual Perception, Physiology, Psychology and Ecology
Specifically, disparity $d$ is given by the following relation
$$ d = r - l = D - F $$Forsyth and Ponce
Take two pictures of the same subject from two slightly different viewpoints and display so that each eye sees only one of the images.
Invented by Sir Charles Weatstone, 1838.
Image from fisher-price.com
http://www.johnsonshawmuseum.org
http://www.johnsonshawmuseum.org
http://www.johnsonshawmuseum.org
Stereo glasses needed to watch 3D movies.
http://www.johnsonshawmuseum.org
Brain fuses information from left/right images when these are shown in quick succession to give an appearance of depth.
http://www.well.com/~jimg/stereo/stereo_list.html
http://www.well.com/~jimg/stereo/stereo_list.html
Try to see beyond the image, merging the two dots in the process. You should percieve a 3D structure. See if you can see it.
magiceye.com
If it works out, you will see the following 3D structure.
magiceye.com
Find the same point in two images (local feature descriptors?)
We’ll assume for now that these parameters are given and fixed.
Here extrinsic parameters describe the relationship between two cameras or a single moving camera.
Assuming parallel optical axes and known camera parameters (i.e., calibrated cameras), we get the following setup:
Consider $\triangle (p_l, \mathbf{p}, p_r)$ shown in red and $\triangle (O_l, \mathbf{p}, O_r)$ shown in blue in the following figure
$\triangle (p_l, \mathbf{p}, p_r) \sim \triangle (O_l,\mathbf{p},O_r)$
Then
$$ \begin{align} \frac{T + \mathbf{x}_l - \mathbf{x}_r}{Z - f} &= \frac{T}{Z} \\ \Rightarrow Z &= f \frac{T}{\mathbf{x}_r - \mathbf{x}_l} \end{align} $$This suggests that we are able to estimate the depth of point $\mathbf{p}$ by using disparity $(\mathbf{x}_r - \mathbf{x}_l)$.
If we can find corresponding points (locations) in two images (top and middle), we can compute disparity for these locations (last row). We can then use disparity to calculate the relative depth.
Then
$$ (x',y') = (x + d(x,y), y) $$Aside: the red dot without arrows in top figure is $(x',y')$ and the red dot without arrows in bottom figure is $(x,y)$. This confirms that the same 3D point appears at different locations in the two images.
Triangulate from corresponding image points in two or more images.
Given an image point $p$ in the left image, its correspondence points in the right image must lie on the line $\overline{e'p'}$. Note that this line is the intersection of two planes: 1) right image plane and plane $oo'p$. Plane $oo'p$ is contains the point $p$ in the left image and the two optical centers.
Figure from Hartley and Zisserman
Figure from Hartley and Zisserman
Corresponding epipolar lines drawn in the left and the right images.
Figure from Hartley and Zisserman
Figure from Hartley and Zisserman
Figure from Hartley and Zisserman
Figure from Hartley and Zisserman
Using the epipolar constraint reduces stereo matching to "search in 1D" problem.
Figure from Hartley and Zisserman
Fundamental matrix is algebraic representation of epipolar geometry.
Consider the following setup. For a point $\mathbf{x}$ in image $I$ there exists an epipolar line $\mathbf{l}'$ in image $I'$. Any point $\mathbf{x}'$ in image $I'$ matching point $\mathbf{x}$ must sit on this line $\mathbf{l}'$.
Let us a consider a plane $\Pi$ that doesn't pass through either camera centers. A ray through the first camera center passing through point $\mathbf{x}$ meets plane $\Pi$ at point $\mathbf{X}$. Point $\mathbf{X}$ is then projected on image 2 at point $\mathbf{x}'$. $\mathbf{x}'$ has to lie on epipolar line $\mathbf{l}'$. Points $\mathbf{x}$ and $\mathbf{x}'$ are projectively equivalent to $\mathbf{X}$, because points $\mathbf{x}$ and $\mathbf{x}'$ are images of 3D point $\mathbf{X}$ lying on a plane. This means that points $\mathbf{x}$ and $\mathbf{x}'$ are related via a homography $H_{\Pi}$. We assume that $\mathbf{X}$ lies on a plane simply for mathematical convenience. The above discussion holds in general.
Thus
$$ \mathbf{x}' = H_{\Pi} \mathbf{x} $$Epipolar line corresponding to point $\mathbf{x}$ passes through epipole $\mathbf{e}'$ and point $\mathbf{x'}$. Therefore,
$$ \begin{align} \mathbf{l}' &= \mathbf{e}' \times \mathbf{x}' \\ &=[\mathbf{e}']_{\times} \mathbf{x}' \\ &= [\mathbf{e}']_{\times} H_{\Pi} \mathbf{x} \\ &= F \mathbf{x} \end{align} $$Here $F = [\mathbf{e}']_{\times} H_{\Pi}$ is the fundamental matrix. $[\mathbf{e}']_{\times}$ has rank 2 and $H_{\Pi}$ has rank 3; therefore, $F$ has rank 2. $F$ represents a mapping from 2-dimensional onto a 1-dimensional projective space.
Consider a camera matrix $P$. The ray backprojected from $\mathbf{x}$ by $P$ is obtained by solving $P \mathbf{X} = \mathbf{x}$:
$$ \mathbf{X}(\lambda) = P^+ \mathbf{x} + \lambda \mathbf{C} $$Here $P^+$ is the pseudo inverse of $P$, such that $PP^+ = \mathbf{I}$. $\mathbf{C}$ is the camera center obtained using $P \mathbf{C}=$ ($\mathbf{C}$ is null-vector of $P$.)
Two points on line above are $P^+ \mathbf{x}$ when $\lambda=0$ and $\mathbf{C}$ when $\lambda=\infty$. These two points are imaged by the second camera $P'$ at $P'P^+\mathbf{x}$ and $P'\mathbf{C}$. Note that $P'\mathbf{C}$ is epipole $\mathbf{e}'$.
The epipolar line is
$$ \begin{align} \mathbf{l}' &= (P' \mathbf{C}) \times (P' P^+ \mathbf{x}) \\ &= [\mathbf{e}']_{\times} P'P^+ \mathbf{x} \\ & = F \mathbf{x} \end{align} $$This is essentially the same formulation. Here $H_{\Pi} = P'P^+$.
Note that point $\mathbf{x}'$ lies on epipolar line $l'$. Therefore, the following must be true:
$$ \begin{align} &\ {\mathbf{x}'}^T \mathbf{l}' &= 0 \\ \Rightarrow &\ {\mathbf{x}'}^T F \mathbf{x} &= 0 \end{align} $$The fundamental matrix $F$ is a 3-by-3 matrix:
$$ F = \left[ \begin{array}{ccc} f_{11} & f_{12} & f_{13} \\ f_{21} & f_{22} & f_{23} \\ f_{31} & f_{32} & f_{33} \\ \end{array} \right]. $$It can be determined up to an arbitrary scale factor, i.e., this matrix has 8 unknowns. Consequently, we need eight equations to estimate this matrix. Given points $\mathbf{x}=(x,y,1)$ and $\mathbf{x}'=(x',y',1)$, we know that
$$ \mathbf{x}' F \mathbf{x} = 0. $$We can re-write the above equation as follows:
$$ \left[ \begin{array}{ccccccccc} xx' & yx' & x' & xy' & yy' & y' & x & y & 1 \end{array} \right] \left[ \begin{array}{c} f_{11} \\ f_{12} \\ f_{13} \\ f_{21} \\ f_{22} \\ f_{23} \\ f_{31} \\ f_{32} \\ f_{33} \end{array} \right] = 0 $$If we have eight points, we can stack 8 such equations and set up the following system of linear equations
$$ \mathbf{A} \mathbf{f} = 0. $$We can solve it by applying Singular Value Decomposition to $\mathbf{A}$. SVD of $\mathbf{A}$ yields $\mathbf{U} \mathbf{S} \mathbf{V}^T$. The solution is the last column of $\mathbf{V}$.
Recall that the fundamental matrix is a rank 2 matrix. The above estimation procedure doesn't account for that. For a more accurate solution, find $\mathbf{F}$ as the closest rank 2 approximation of $\mathbf{f}$.
Epipolar geometry constraints our search, but the stereo matching problem isn't full solved yet. Consider the following rectified image pair. Our goal is to find the pixel/region in the right image that matches the pixel/region highlighted in the left image.
It is often better to match regions, rather than pixel intensities.
The size of the window used for matching effects the overall performance. Smaller windows create more detailed depth maps, but also lead to higher noise. Larger windows result in smoother disparity maps, but these lack fine details.
Matching continues to be a difficult problem, especially when dealing with texture-less surfaces, occlusions, repetitions, non-lambartian surfaces, surfaces exhibiting specularities, or transparencies. Often the following three priors are employed to improve matching.
From Rapid Shape Acquisition Using Color Structured Light and Multi-pass Dynamic Programming, Zhang, Curless, and Seitz
From https://bbzippo.wordpress.com/2010/11/28/kinect-in-infrared/
Given
$$ \mathbf{a}=[a_1, a_2, a_3] $$and
$$ \mathbf{b}=[b_1, b_2, b_3] $$Define
$$ [\mathbf{a}]_{\times}= \left[ \begin{array}{ccc} 0 & -a_3 & a_2 \\ a_3 & 0 & -a_1 \\ -a_2 & a_1 & 0 \\ \end{array}\right] $$Then
$$ \mathbf{a} \times \mathbf{b} = [\mathbf{a}]_{\times} \mathbf{b} $$