Faisal Qureshi

Professor

Faculty of Science

Ontario Tech University

Oshawa ON Canada

http://vclab.science.ontariotechu.ca

© Faisal Qureshi

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

In [4]:

```
import numpy as np
import math
from scipy import linalg
```

- Pinhole camera model
- Homoegenous coordinates
- Intrinsic and extrinsic camera matrices
- Lens effects
- Camera calibration

Known during the Classical Period (e.g., China c. 470BC)

Outside scene imaged on a wall in Prague Castle. This probably occured due to a small crack on the opposite wall.

Given $Y$, the height of the object, $Z$, its distance from the pinhole, and $f$, the distance between the pinhole and the image plane (also known as the *focal length*), we want to estimate $y$, the height of the image of this object.

Using the *similar triangles* argument, we get

Re-arranging the above expression yields

$$ y = f Y / Z. $$Pinhole model describes the relationship between a 3d point $(X,Y,Z)$ and its imaging location $(x,y)$. Here $f$ is the distance of the image plane and the pinhole, also known as the focal length, and $Z$ is the distance of the point from the pinhole. This formulation assumes that pinhole is situated at location $(0,0,0)$, or the *origin*, and the camera's optical axis is aligned with the global z-axis.

Invariant to Scaling, i.e., $(x,y,w) = k(x,y,w)$. Both $(x,y,w)$ and $(kx, ky, kw)$ represent the same 2D Cartesian point $(x/w, y/w)$.

Consider two 2D Cartesian points $(x_1,y_1)$ and $(x_2,y_2)$. We can find the line that passes through these points using cross-product in homogeneous coordinates.

Recall that equation of a line is $ax + by+c = 0$ then $$ (a,b,c) = (x_1,y_1,1) \times (x_2,y_2,1) $$

If line $a x + b y + c = 0$ satisfies points $(x_1, y_1)$ and $(x_2, y_2)$ then the following must be true

$$ \begin{align} a x_1 + b y_1 + c &= 0 \\ a x_2 + b y_2 + c &= 0 \\ \end{align} $$We can re-write these equations in terms of the dot product of two vectors $\left[ \begin{array}{ccc} a & b & c \end{array} \right] \left[ \begin{array}{c} x_1 \\ y_1 \\ 1 \end{array} \right] = 0$ and $\left[ \begin{array}{ccc} a & b & c \end{array} \right] \left[ \begin{array}{c} x_2 \\ y_2 \\ 1 \end{array} \right] = 0$. The dot product of two vectors is only zero if the two vectors are perpendicular to each other. This suggests that vector $\left[ \begin{array}{ccc} a & b & c \end{array} \right]$ is perpendicular to both $\left[ \begin{array}{ccc} x_1 & y_1 & 1 \end{array} \right]$ and $\left[ \begin{array}{ccc} x_2 & y_2 & 1 \end{array} \right]$.

Given two vectors, their cross-product returns a vector that is perpendicular to both vectors. Consequently, we can compute $\left[ \begin{array}{ccc} a & b & c \end{array} \right]$ by taking a cross-product of vectors $\left[ \begin{array}{ccc} x_1 & y_1 & 1 \end{array} \right]$ and $\left[ \begin{array}{ccc} x_2 & y_2 & 1 \end{array} \right]$.

Cross-product of two vectors $(d,e,f)$ and $(r,s,t)$ is computed as seen below.

$$ (d,e,f) \times (r,s,t) = \mathrm{det} \left[\begin{array}{ccc} \hat{i} & \hat{j} & \hat{k}\\ d & e & f \\ r & s & t \end{array} \right] $$Find a representation of the line between points $\mathbf{a}=(1,2)$ and $\mathbf{b}=(4,3)$ using homogeneous coordinates.

In [5]:

```
# Your solution here
```

In [6]:

```
# %load solutions/image-formation/solution_01.py
```

Given two lines $a_1 x + b_1 y + c_1=0$ and $a_2 x + b_2 y + c_2=0$, we can compute their intersection using cross-product as follows:

**Step 1: Find intersection point in homogeneous coordinates**

**Step 2: Compute Cartesian coordinates from homogeneous coordinates**

Intersection point is then $(x/w,y/w)$

Along the similar lines as before.

Use homogeneous coordinates to determine the intersection location for the following two lines.

$$ \begin{align} 3x - y + 7 & = 0 \\ 2x - y + 1 & = 0 \end{align} $$In [7]:

```
l1 = np.array([3,-1,7])
l2 = np.array([2,-1,1])
ph = np.cross(l1, l2)
print('ph = {}'.format(ph))
p = ph[:2] / ph[2]
print('p = {}'.format(p))
print('Point p sits on l1: {}'.format(math.isclose(np.dot(l1, np.append(p,[1])), 0.0)))
print('Point p sits on l2: {}'.format(math.isclose(np.dot(l2, np.append(p,[1])), 0.0)))
```

Find intersection location for the following two lines.

$$ \begin{align} 3x - y + 7 & = 0 \\ 3x - y - 3 & = 0 \end{align} $$In [8]:

```
# Your solution here
```

In [9]:

```
# %load solutions/image-formation/solution_02.py
```

$w=0$ if two lines are parallel. We say that these lines intersect at infinity. This suggests that it is possible to represent points at infinity in homogeneous coordinates. Just set the homogeneous coordinate to $0$.

We can represent a point at infinity by setting its $w=0$. E.g., Homogeneous coordinate $(3,-1,3,0)$ represents a point at infinity in the direction $(3,-1,3,0)$.

We now develop a mathematical model for pinhole cameras. This discussion follows Ch. 6 of Multiple View Geometry in Computer Vision (2nd Ed.) by Harley and Zisserman. For the following discussion we assume the following camera geometry.

This assumes that the image plane sits in front of the pinhole. Obviously, no physical camera has this property. Never the less it is mathematically convenient (and equivalent) to model pinhole cameras as such.

- Camera centre sits at the origin of the Euclidean coordinate system Camera centre is also known as center of projection or optical centre
- Image plane (also known as the focal plane) sits at $Z = f$
- Point in 3D space $$ \mathbf{X} = (X, Y, Z)^T $$
- This point is mapped to $$ \left( f X / Z, f Y / Z, f \right) $$
- Image coordinates are (just drop $f$) $$ \left( f X / Z, f Y / Z \right) $$
- Principal axis or principal ray is the line parallel to the image plane and passing through the centre of projection
- Principal point is the intersection location of image plane with the principal axis
- Principal plane is parallel to the image plane and passes through the centre of projection

We can describe the pinhole camera model using homoegenous coordinates as follows

$$ \begin{align} \left( \begin{array}{c} X \\ Y \\ Z \\ 1 \end{array} \right) \longmapsto \left( \begin{array}{c} fX \\ fY \\ Z \end{array} \right) & = \left[ \begin{array}{cccc} f & & & 0 \\ & f & & 0 \\ & & 1 & 0 \end{array} \right] \left( \begin{array}{c} X \\ Y \\ Z \\ 1 \end{array} \right) \\ & = \mathrm{diag}(f,f,1)\left[\mathbf{I}\ |\ \mathbf{0}\right] \left( \begin{array}{c} X \\ Y \\ Z \\ 1 \end{array} \right) \end{align} $$We set

$$ \mathtt{P} = \mathrm{diag}(f,f,1)\left[\mathbf{I}\ |\ \mathbf{0}\right] $$to denote the projection matrix. Specifically, we use $\mathtt{P}$ to map 3D points $\mathbf{X}$ to image points $\mathbf{x}$:

$$ \mathbf{x} = \mathtt{P} \mathbf{X}. $$Under these assumptions, $\mathtt{P}$ has one degree-of-freedom, namely, the parameter $f$.

The above discussion assumes that the principle point sits at the origin of the coordinates in the image plane. Lets relax this assumption. Say $(p_x, p_y)$ are the coordinates of the principal point. Then

$$ \begin{align} \left( \begin{array}{c} X \\ Y \\ Z \\ 1 \end{array} \right) \longmapsto \left( \begin{array}{c} fX + p_x Z \\ fY + p_y Z \\ Z \end{array} \right) & = \left[ \begin{array}{cccc} f & & p_x & 0 \\ & f & p_y & 0 \\ & & 1 & 0 \end{array} \right] \left( \begin{array}{c} X \\ Y \\ Z \\ 1 \end{array} \right) \\ & = \mathbf{K} \left[\mathbf{I}\ |\ \mathbf{0}\right] \mathbf{X}_{\mathtt{cam}} \end{align} $$Note $\mathbf{X}_{\mathtt{cam}}$. This captures our assumption that the camera centre is located at the origin of the Euclidean coordinate system (we can think of this as the world coordinate system) with camera pointing straight down the $z$-axis. $\mathbf{X}_{\mathtt{cam}}$ denotes that the point is expressed in the camera coordinate system.

In this case, we set projection matrix to

$$ \mathtt{P} = \mathbf{K} \left[\mathbf{I}\ |\ \mathbf{0}\right] $$How many degrees of freedom does a projection matrix have when the camera centre is assumed to be fixed at the origin with the camera pointing straight down the $z$-axis? That is, how many parameters are required to specify a projection matrix $\mathtt{P}$ uniquely under these assumptions?

In [10]:

```
%%markdown
**Answer:** Your solution here:
```

In [11]:

```
# %load solutions/image-formation/solution_03.md
```

Now we relax our assumption that the camera centre is located at the origin of the Euclidean coordinate system (we can think of this as the world coordinate system) with camera pointing straight down the $z$-axis.

Say camera centre in inhomogeneous coordinates is $\tilde{\mathbf{C}}$ and that camera rotation is represented by matrix $\mathbf{R}$. Then we can transform an inhomogeneous point $\tilde{\mathbf{X}}$ in the world coordinate system to an inhomogeneous point $\tilde{\mathbf{X}}_{\mathtt{cam}}$ in the camera coordinates system as follows

$$ \tilde{\mathbf{X}}_{\mathtt{cam}} = \mathbf{R} \left( \tilde{\mathbf{X}} - \tilde{\mathbf{C}} \right) $$We can re-write this as follows

$$ \mathbf{X}_{\mathtt{cam}} = \left[ \begin{array}{cc} \mathbf{R} & - \mathbf{R} \tilde{C} \\ \mathbf{0} & 1 \end{array} \right] \left( \begin{array}{c} X \\ Y \\ Z \\ 1 \end{array} \right) $$Putting it all together

$$ \mathbf{x} = \mathbf{K} \mathbf{R} \left[\mathbf{I}\ |\ -\tilde{C}\right] \mathbf{X} $$It is often convenient to set $\mathbf{t} = -\mathbf{R} \tilde{C}$. Then we can rewrite the above as follows

$$ \mathbf{x} = \mathbf{K} \left[\mathbf{R}\ |\ \mathbf{t} \right] \mathbf{X} $$In this case, we set projection matrix to

$$ \mathtt{P} = \mathbf{K} \left[\mathbf{R}\ |\ \mathbf{t} \right] $$How many degrees of freedom does a projection matrix have when the camera centre is assumed to be fixed at $\tilde{\mathbf{C}}$ and the camera's rotation is represented by matrix $\mathbf{R}$? That is, how many parameters are required to specify a projection matrix $\mathtt{P}$ uniquely under these assumptions?

In [12]:

```
%%markdown
**Answer:** Your solution here:
```

In [13]:

```
# %load solutions/image-formation/solution_04.md
```

Thus far we have assumed that image coordinates have equal scales in $x$ and $y$. If the number of pixels per unit distance in image coordinates are $m_x$ and $m_y$ in the $x$ and $y$ directions then we re-write $\mathbf{K}$ as follows

$$ \mathbf{K} = \left[ \begin{array}{ccc} \alpha_x & s & x_0 \\ & \alpha_y & y_0 \\ & & 1 \end{array} \right] $$where $\alpha_x = f m_x$, $\alpha_y = f m_y$, $x_0 = m_x p_x$, $y_0 = m_y p_y$, $s$ is the skew term, which is often $0$. Note that $x_0$ and $y_0$ are the coordiantes of hte principal point in terms of pixel dimensions in $x$ and $y$ directions. $s$ becomes non-zero under special circumstances when lens is not aligned with the image plane.

How many degrees of freedom does a projection matrix have when the camera centre is assumed to be fixed at $\tilde{\mathbf{C}}$ and the camera's rotation is represented by matrix $\mathbf{R}$ when the pixels are non-uniformly scaled as just described? That is, how many parameters are required to specify a projection matrix $\mathtt{P}$ uniquely under these assumptions?

In [14]:

```
%%markdown
**Answer:** Your solution here:
```

In [15]:

```
# %load solutions/image-formation/solution_05.md
```

We get the following equation that describes the mapping from world points to image points.

$$ \begin{align} \mathbf{x} & = \mathtt{P} \mathbf{X} \\ & = \mathbf{K} \left[ \mathbf{R}\ |\ \mathbf{t} \right] \mathbf{X} \end{align} $$Note that $\mathtt{P}$ is the same as a homogeneous $3 \times 4$ matrix defined upto an arbitrary scale of rank 3. Projective matrices of rank less than 3 do not span the image plane. The projections might degenerate to a line or a point.

We refer to $\mathbf{K}$ as the intrinsic matrix, and $\left[ \mathbf{R}\ |\ \mathbf{t} \right]$ is the extrinsic matrix.

To make the following discussion easier, lets express $\mathtt{P}$ as

$$ \mathtt{P} = \left[ \mathtt{M}\ |\ \mathbf{p}_4 \right] $$Here $\mathtt{M}$ is the top-left $3 \times 3$ submatrix of $\mathtt{P}$ and $\mathbf{p}_4$ is the last column of $\mathtt{P}$. Furthermore, let $\mathbf{p}^{iT}$ and $\mathbf{p}_j$ represent the $i$-th row and $j$-th column of $\mathtt{P}$, respectively. We can similarly denote rows and columns of $\mathtt{M}$.

**Matrix** $\mathtt{M}$

Recall that $ \mathtt{P} = \left[ \begin{array}{cccc} p_{11} & p_{12} & p_{13} & p_{14} \\ p_{21} & p_{22} & p_{23} & p_{24} \\ p_{31} & p_{32} & p_{33} & p_{34} \end{array} \right]. $ Let $ \mathtt{M} = \left[ \begin{array}{ccc} p_{11} & p_{12} & p_{13} \\ p_{21} & p_{22} & p_{23} \\ p_{31} & p_{32} & p_{33} \end{array} \right] $ and $ \mathbf{p}_4 = \left[ \begin{array}{c} p_{14} \\ p_{24} \\ p_{34} \end{array} \right]. $ Then $\mathtt{P} = \left[ \mathtt{M}\ |\ \mathbf{p}_4 \right].$

If $\mathtt{M}$ is non-singular than we have a finite projective camera. There are cases when $\mathtt{M}$ is singular. These are referred to as general projective cameras.

In general camera center is the 1-dimensional right nullspace of matrix $\mathtt{P}$ since $\mathtt{P} \mathbf{C} = \mathbf{0}$.

Alternately, we can use the following to find the camera center

$$ \tilde{\mathbf{C}} = - \mathtt{M}^{-1} \left( \begin{array}{c} p_{14} \\ p_{24} \\ p_{34} \end{array} \right) $$Recall that

$$ \mathbf{C} = \left( \tilde{C}^T, 1 \right)^T $$*Aside 1*:

If $\mathtt{M}$ is singular (in case of general projective cameras), camera center (at infinity) is the right null space of $\mathtt{M}$, i.e., $\mathtt{M} \mathbf{d} = \mathbf{0}$. In this case,

$$ \mathbf{C} = \left( \mathbf{d}^T, 0 \right)^T $$*Aside 2*:

The singular value decomposition (SVD) of a matrix $\mathtt{P}$ may be written as

$$ \mathtt{P} = \mathtt{U} \mathtt{\Sigma} \mathtt{V}^T $$Then the (right) null space of $\mathtt{P}$ is the columns of $\mathtt{V}$ corresponding to singular values equal to zero.

Matrix $$ \mathtt{P} = \left[ \begin{array}{cccc} p_{11} & p_{12} & p_{13} & p_{14} \\ p_{21} & p_{22} & p_{23} & p_{24} \\ p_{31} & p_{32} & p_{33} & p_{34} \end{array} \right] $$ consits of 4 columns $$ \mathbf{p}_1 = \left[ \begin{array}{c} p_{11} \\ p_{21} \\ p_{31} \end{array} \right],\ \mathbf{p}_2 = \left[ \begin{array}{c} p_{12} \\ p_{22} \\ p_{32} \end{array} \right],\ \mathbf{p}_3 = \left[ \begin{array}{c} p_{13} \\ p_{23} \\ p_{33} \end{array} \right],\mathrm{\ and\ } \mathbf{p}_4 = \left[ \begin{array}{c} p_{14} \\ p_{24} \\ p_{34} \end{array} \right] $$

and 3 rows

$$ \begin{align} \mathbf{p}^{1T} & = \left[ \begin{array}{cccc} p_{11} & p_{12} & p_{13} & p_{14} \end{array} \right], \\ \mathbf{p}^{2T} &= \left[ \begin{array}{cccc} p_{21} & p_{22} & p_{23} & p_{24} \end{array} \right],\mathrm{\ and} \\ \mathbf{p}^{3T} &= \left[ \begin{array}{cccc} p_{31} & p_{32} & p_{33} & p_{34} \end{array} \right] \end{align} $$$\mathbf{p}_1$, $\mathbf{p}_2$ and $\mathbf{p}_3$ represent vanishing points in the $x$, $y$, and $z$ directions.

$\mathbf{p}_4$ is the image of the world origin.

*Aside*:

Homogeneous point $\mathbf{D} = (1,0,0,0)^T$ represent the $x$-axis. Then $\mathtt{P} \mathtt{D} = \mathbf{p_1}$ is the vanishing point in $x$-direction.

Figure from Hartley and Zisserman

$\mathbf{p}^{1T}$ and $\mathbf{p}^{2T}$ represent planes defined by the camera centre and image plane coordinate axis $x$ and $y$.

$\mathbf{p}^{3T}$ represents the principal plane.

Figure from Hartley and Zisserman

One way to convince ourselves that $\mathbf{p}^{3T}$ represents the principal plane is to see that the only way a point $\mathbf{X}$ maps to $(x,y,0)$ is if $\mathbf{p}^{3} \mathbf{X} = 0$. Similar arguments can be made for $\mathbf{p}^{1T}$ and $\mathbf{p}^{2T}$ representing planes defined by camera center and image plane cooridnate axis $x$ and $y$, respectively.

The principal point is given by $\mathtt{M} \mathbf{m}^{3}$, where $\mathbf{m}^{3T}$ is the third row of $\mathtt{M}$.

*Aside*:

Notice that the principal plane is given by $\mathbf{p}^{3T}$. The normal to this plane is $(p_{31}, p_{32}, p_{33})$ and using this information we can set a point at infinity using homogeneous coordinates as follows

$$ \hat{\mathbf{p}}^{3} = (p_{31}, p_{32}, p_{33}, 0) $$Now project this point using $\mathtt{P}$ and we get the principal point.

$\mathbf{v} = \mathrm{det}(\mathtt{M}) \mathbf{m}^3$ is a vector in the direction of the principal axis, directed towards the front of the camera.

Points in the world are mapped to image coordinates using

$$ \mathbf{x} = \mathtt{P} \mathbf{X} $$Points $\mathbf{D}=(\mathbf{d}^T,0)$ on the plane at infinity map to vanishing points in the image $\mathbf{x} = \mathtt{M} \mathbf{d}$

Given a point $\mathbf{x}$ in the image, we seek to determine the set of points in space that map to this point. This set of points is a ray passing point $\mathtt{P}^+ \mathbf{x}$ and the camera centre.

$$ \mathbf{X}(\lambda) = \mathtt{P^+} \mathbf{x} + \lambda \mathbf{C} $$Here $\mathtt{P}^+ = \mathtt{P}^T \left( \mathtt{P} \mathtt{P}^T \right)^{-1}$ is the psuedo-inverse of $\mathtt{P}$. Note that $\mathtt{P} \mathtt{P}^+ = \mathbf{I}$.

For finite cameras, we can develop a simpler expression. Recall that camera centre for these cameras is given by $\tilde{\mathbf{C}} = - \mathtt{M}^{-1} \mathbf{p}_4$. An image point $\mathbf{x}$ back projects to a ray intersecting the plane at infinity at the points $\mathbf{D} = \left( \left( \mathtt{M}^{-1} \mathbf{x} \right), 0 \right)^T$. Using camera centre and $\mathbf{D}$, we get

$$ \mathbf{X}(\mu) = \mu \left( \begin{array}{c} \mathtt{M}^{-1} \mathbf{x} \\ 0 \end{array} \right) + \left( \begin{array}{c} - \mathtt{M}^{-1} \mathbf{p}_4 \\ 1 \end{array} \right) $$Given a camera matrix $\mathtt{P}$, we often desire to find: camera centre, camera rotation, and camera intrinsic parameters.

We can find camera center by using $\mathtt{P} \mathbf{C} = 0$ as discussed earlier, i.e., by finding a null vector of the projection matrix.

Recall that

$$ \mathtt{P} = \left[ \mathtt{M}\ |\ -\mathtt{M} \tilde{\mathbf{C}} \right] = \mathtt{K} \left[ \mathtt{R}\ |\ - \mathtt{R} \tilde{\mathbf{C}} \right]. $$We can find $\mathtt{K}$ (camera intrisic parameters) and $\mathtt{R}$ (camera rotation) using the RQ-decomposition of $\mathtt{M}$. We require that the diagonal entries of $\mathtt{K}$ are positive to get rid of the ambiguity in the decomposition.

We can use RQ decomposition to find $\mathbf{K}$ and $\mathbf{R}$ matrices from $\mathtt{M}$

$$ \begin{align} \mathbf{x} & = \mathtt{P} \mathbf{X} \\ & = \mathbf{K} \left[ \mathbf{R}\ |\ \mathbf{t} \right] \mathbf{X} \\ & = \left[ \mathbf{K} \mathbf{R}\ |\ \mathbf{K} \mathbf{t} \right] \mathbf{X} \end{align} $$This suggest that

$$ \mathtt{M} = \left[ \mathbf{K} \mathbf{R} \right], $$where $\mathbf{K}$ is an upper-triangular matrix and $\mathbf{R}$ is an orthogonal matrix.

Use the fact that

```
FOO = np.diag(np.sign(np.diag(K)))
np.dot(FOO, FOO) = np.eye(3)
```

(This is example 6.2 from Multiple View Geometry in Computer Vision (2nd Ed.) by Harley and Zisserman)

Suppose we are given a projection matrix $\mathtt{P}$ as follows:

$$ \mathtt{P} = \begin{bmatrix} 3.53553\times10^{2} & 3.39645\times10^{2} & 2.77744\times10^{2} & -1.44946\times10^{6} \\ -1.03528\times10^{2} & 2.33212\times10^{1} & 4.59607\times10^{2} &-6.32525\times10^{5} \\ 7.07107\times10^{-1} & -3.53553\times10^{-1} & 6.12372\times10^{-1} & -9.18559\times10^{2} \end{bmatrix} $$Use $\mathtt{P}$ as given above to find the camera centre, the rotation, and the intrinsic parameters.

In [18]:

```
P = np.array([[3.53553e2, 3.39645e2, 2.77744e2, -1.44946e6],
[-1.03528e2, 2.33212e1, 4.59607e2, -6.32525e5],
[7.07107e-1, -3.53553e-1, 6.12372e-1, -9.18559e2]])
```

In [19]:

```
# %load solutions/image-formation/solution_06.py
```

In [20]:

```
M = P[:3,:3]
print(f'M = {M}')
M_inv = np.linalg.inv(M)
print(f'M_inv = {M_inv}')
p_4 = P[:,-1]
print(f'p_4 = {p_4}')
C = - np.dot(M_inv, p_4)
print(f'Center of this camera is {C}')
```

M is given, how to find K and R? Ans: RQ decomposition.

In [21]:

```
K, R = linalg.rq(M)
print(f'K = {K}')
print(f'R = {R}')
```

In [22]:

```
np.isclose(np.dot(R[:,0], R[:,1]), 0)
```

Out[22]:

In [23]:

```
np.dot(R, R.T)
```

Out[23]:

In [24]:

```
K
```

Out[24]:

In [25]:

```
np.diag(K)
```

Out[25]:

In [26]:

```
np.sign(np.diag(K))
```

Out[26]:

In [27]:

```
FOO = np.diag(np.sign(np.diag(K)))
print(f'FOO = {FOO}')
```

In [28]:

```
np.dot(FOO, FOO) # FORTRAN IDENTITY
```

Out[28]:

M = K * R = (K * FOO) * (FOO * R)

In [29]:

```
new_K = np.dot(K, FOO)
print(f'new_K = {new_K}')
```

In [30]:

```
new_R = np.dot(FOO, R)
print(f'new_R = {new_R}')
```

We have established that projective matrix

$$ \mathtt{P} = \left[ \begin{array}{cccc} p_{11} & p_{12} & p_{13} & p_{14} \\ p_{21} & p_{22} & p_{23} & p_{24} \\ p_{31} & p_{32} & p_{33} & p_{34} \end{array} \right] $$encodes useful properties about camera. This matrix controls the imaging process: $ \mathbf{x} = \mathtt{P} \mathbf{X}. $

We can also use this matrix to back project a point in the image, which can be used to perform metric scene analysis. The question remains, however. Given a camera, how do we actually find this matrix? The process of finding $\mathtt{P}$ or its variant components is referred to *Camera Calibration*.

A detailed treatment of this subject is left as a self-study exercise. Here we present the basic idea. Say we have a number of point correspondences $\mathbf{X}_i \longleftrightarrow \mathbf{x}_i$ between 3D points $\mathbf{X}_i$ and 2D image points $\mathbf{x}_i$. We are interested in finding the camera matrix $\mathtt{P}$ such that for all $i$

$$ \mathbf{x}_i = \mathtt{P} \mathbf{X}_i. $$With some patience, we can re-arrange the above as follows

$$ \left[ \begin{array}{ccc} \mathbf{0}^T & -w_i \mathbf{X}_i^T & y_i \mathbf{X}_i^T \\ w_i \mathbf{X}_i^T & \mathbf{0}^T & - x_i \mathbf{X}_i^T \\ - y_i \mathbf{X}_i^T & x_i \mathbf{X}_i^T & \mathbf{0}^T \end{array} \right] \left( \begin{array}{c} \mathbf{p}^1 \\ \mathbf{p}^2 \\ \mathbf{p}^3 \\ \end{array} \right) = \mathbf{0}. $$Here each $\mathbf{p}^{iT}$ is a 4-vector, the $i$-th row of $\mathtt{P}$. The three equations are linearly dependent, so we often just two the first two. Specifically, we use

$$ \left[ \begin{array}{ccc} \mathbf{0}^T & -w_i \mathbf{X}_i^T & y_i \mathbf{X}_i^T \\ w_i \mathbf{X}_i^T & \mathbf{0}^T & - x_i \mathbf{X}_i^T \end{array} \right] \left( \begin{array}{c} \mathbf{p}^1 \\ \mathbf{p}^2 \\ \mathbf{p}^3 \\ \end{array} \right) = \mathbf{0}. $$From a set of $n$ point correspondences, we obtain a $2n \times 12$ matrix. Recall that $\mathtt{P}$ has 11 degrees of freedom, so we just need 11 equations (5 and a half points; we only need either $x$ or $y$ for the 6-th correspondence).

We often use more than 5 and half points to compute $\mathtt{P}$. In this case the goal is to minimze $\| \mathtt{A} \mathbf{p} \|$ subject to a suitable constraint. A possible choice is for this constraint is $\| \mathbf{p} \| = 1$.

*Aside*:

We will return to camera calibration at the end of this discussion.

- A special case of perspective projection
- Center of projection (optical center) is at an
*infinite*distance away from the image plane

- Object dimensions are much smaller compared to its distance from the camera, e.g., satellite imagery

Pushbroom cameras is an abstraction for the type of sensors often used in satellite imaging (SPOTS). Here camera is a linear sensor array that captures one row of image at each time. The full image is constructed by moving the camera. When sensor moves, the sensor plane sweeps out a region of space.

Let $\mathbf{X}$ be a point in space, and lets assume that camera matrix is $\mathtt{P}$. Suppose that $\mathtt{P} \mathbf{X} = (x, y, w)^T$. Then the corresponding inhomogeneous image point is $(x, y/w)$. This assumes that the camera motion is along $x$ direction of the image.

- Distant objects appear smaller.
- The same object when imaged using different focal lengths appear to have different sizes.
- Parallel lines meet.
- Angles are not maintained, i.e., perpendicular lines appear meet at angles other than 90 degrees.

Each set of parallel lines meet at a different point on the image plane. This point is called the

*vanishing point*for this direction.Set of parallel lines in the same plane lead to *collinear vanishing points. The lines formed by these vanishing points is called that horizon for that plane.

Vanishing points can be used for 3D scene analysis

These can be used to spot fake images, e.g., scale or perspective doesn't work

Is this a perspective image of four identical buildings? If not, why not?

Images captured by pinhole cameras are generally two dark. In order to allow more light to pass through and reach the sensor plane, we need to increase the size of the pinhole. This has the effect of making the image blurry. We can undo this by using a lens that will focus the light onto the image plane. Lenses, however, introduce radial and chromatic distortions.

This discussion follows Ch. 7 of Multiple View Geometry in Computer Vision, Second Edition.

- $(\tilde{x}, \tilde{y})$ is the ideal image position (which obeys linear projection)
- $(x_d, y_d)$ is the actual image position (after radial distortion)
- $\tilde{r}$ is the radial distance $\sqrt{\tilde{x}^2 + \tilde{y}^2}$ from the center
- $L(\tilde{r})$ is the distortion factor (we make a simplifying assumption that the distorion is a function of radius only.

Then we can write

$$ \left( \begin{array}{c} x_d \\ y_d \end{array} \right) = L(\tilde{r}) \left( \begin{array}{c} \tilde{x} \\ \tilde{y} \end{array} \right) $$In pixel coordinates, the correction is written as

$$ \begin{align} \hat{x} & = x_c + L(r) (x - x_c) \\ \hat{y} & = y_c + L(r) (y - y_c) \end{align} $$Here $(x, y)$ are the measured coordinates, $(\hat{x},\hat{y})$ are the corrected coordinates, $(x_c, y_c)$ is the center of radial distorion, and $r = \sqrt{(x-x_c)^2 + (y-y_c)^2}$.

We often use Taylor expansion $L(r) = 1 + \kappa_1 r + \kappa_2 r^2 + \cdots$. The coefficient of radial distortions $\{\kappa_1, \kappa_2, x_c, y_c \}$ are considered part of the intrinsic matrix $\mathtt{K}$. These paramters are often estimated during camera calibration process. $L(r)$ can be estimated by minimizing the deviation from linear mapping, e.g., the distances of end-points of a line to its mid-point.

For further information, we refer the reader to Ch. 7 of Multiple View Geometry in Computer Vision (2nd Ed.) by Harley and Zisserman.

Check out the camera-calibration notbook.

In [ ]:

```
```