Image Formation and the Thin Lens Equation
Learn the basics of image formation and the thin lens equation.
We'll cover the following
Overview
Before we even consider how to assemble 3D scenes, we should first consider how we observe our 3D scenes via 2D renders. Much like how a person sees the world through their eyes (and often a pair of glasses), we observe 3D data through 2D renders. The practice of rendering 3D scenes is a complex subject. First, we need to consider the physics of how real-world images are formed by light.
Images
Images need no introduction. Most of us carry cameras in our pockets these days and capture images with the same ease with which we read and write. However, for pedagogical purposes, let’s define in strict terms what an image is.
For our purposes, an image is simply a regularly-spaced 2D grid composed of many rectangular bins that each contain a record of light. When a digital camera takes a photo, rays of light pass through the camera lens and strike a sensor. This sensor has a regular spacing of photosensors. These photosensors, called pixels, gather charge when receiving light, allowing the camera to estimate how much light struck a particular sensor while the photo was taken. Think of the raw pixel values at each pixel of the 2D image as an estimate of how much light was recorded at that part of the sensor. In the case of RGB images (i.e., color images), pixels record light across three separate wavelengths: red (R), green (G), and blue (B). Images with only a single value for intensity are called grayscale images.
In the numpy
or torch
libraries, an image is most often represented as a batched 3D tensor with the shape
is the number of rows (height) is the number of columns (width) is the number of channels (e.g., for RGB and for grayscale) is the (optional) batch size
For instance, when we say “one RGB image with dimensions
Pinhole camera model
Our theory of image formation will be our bridge between the 2D space of an image and the 3D space of the observed world. As such, having a model of image formation is crucial to 3D machine learning. The simplest model for understanding image formation is the so-called pinhole camera model. Though it does not account for real-world camera effects like distortion, it provides a remarkably intuitive geometric model of how light moves through space to produce the images we capture.
Many of us have made a pinhole camera at home before. It usually consists of some dark chamber, such as a box with a flat surface on one end and a tiny hole, which we call the aperture, through which light can pass on the opposite end. A so-called camera obscura is one type of pinhole camera. As light passes through the aperture, it strikes the flat surface on the other end. The diameter of the aperture forces light to strike points on the flat surface only from specific angles. The effect is that we see a reflection of the outside world cast upon the image plane, and thus, we refer to this surface as the image plane.
In the figure above,
As mentioned previously, the great thing about the pinhole camera model is that it gives us a geometric understanding of our scene. From any point in our 2D plane, we can draw a ray that passes through the aperture and into our 3D scene. Anything along that ray is a potential source of reflected or emitted light for our render. This is in fact how rendering works, but also gives us, in the computer vision space, some tools we can use to learn about a 3D scene from 2D images.
Note how we can relate distances on one side of the aperture to the other (e.g., outside to inside of the camera or vice versa) via similar triangles.
In other words, the ratio between
The thin lens equation
In many cases, we can’t ignore the effects of the lens entirely. For instance, we’ll often encounter situations where focus and exposure are important to consider. In cases like these, we can apply the most basic of lens models: the thin lens equation. A lens is simply just a piece of glass, often curved on both sides, which is inserted into the aperture of a camera. A so-called thin lens is simply one where the thickness of the glass is negligibly small compared to the radii of curvature of either side of the glass.
This lens focuses the light passing through it onto the image plane. Because of the geometry of the light passing through the lens, only light from objects at a certain distance from the aperture will reach the image plane in tight focus. In the figure above, each beam of light passes through the lens, changes direction via
The resulting effect is that light outside of this region appears blurry and out-of-focus. Photographers refer to this region of focus as the depth of field. Light that arrives from outside of this region will be distributed in a ring known as the circle of confusion, a name that hints at the ambiguous appearance that this unfocused light causes. This is precisely why out of focus portions of an image appear as if their details are blurred in a radius around their true position.
The thin lens equation provides the relationship between the focal length and the depth of field, telling us which parts of the world will be in focus and which parts out of focus. This phenomenon is why photographers have to adjust the focus of their cameras to keep objects looking clear. Photographers often utilize this effect to create visually stunning portraits and macro photography.
The thin lens equation relates the following properties:
focal length
: The distance between the aperture and focal point. image distance
: The distance between the aperture and the image plane. object distance
: The distance between the aperture and the observed object.
Try to use the thin lens equation to estimate the optimal object distance for a given focal length and image distance.
def thin_lens(f: float,di: float) -> float:"""Apply the thin lens equation to solve object distanceArgs:f: Focal lengthdi: Distance between the aperture and image plane"""return 0.0
While these models are simplistic and don’t account for many real-world effects, we’ll see that we can still use them effectively in 3D machine learning, often by using other techniques like camera calibration to coax our data so that it fits these models.