How Do Vision Transformers See Depth in Single Images?

Investigating the visual cues that vision transformers use for monocular depth estimation. This is an extension of an analysis originally done for convolutional neural networks.


Monocular depth estimation is a task in Computer Vision, where a depth map is estimated from a single input image. This differs from approaches like SLAM or Structure-from-Motion, where multiple camera images can be used to estimate the geometry in the scene. It is often assumed that monocular depth estimation algorithms, similar to human vision, make use of so-called pictorial cues like the apparent size of a known object or its position in relation to the horizon line .

Van Dijk and de Croon published a paper that investigated the visual cues that neural networks exploit when trained for monocular depth estimation . The authors generated images using the KITTI benchmark that tested specific visual cues . The trained neural networks were then evaluated on their monocular depth estimation performance on these generated images.

Van Dijk and de Croon’s paper is fairly unique in its in-depth analysis of intentionally constructed images to infer the inductive biases of the neural networks. This type of analysis lends itself well for image-based approaches, since the manipulation of annotated images to construct synthetic example images with specific properties is fairly easy.

Very recent advances in Natural Language Processing make use of the transformer architecture , which has led to an adoption of transformers for the Computer Vision domain . Novel methods for monocular depth estimation like DPT , NeWCRFs and BinsFormer make use of vision transformers, an architectual design not found in the four neural networks that Van Dijk and de Croon analzyed: MonoDepth , SfMLearner , Semodepth and LKVOLearner The original publication from Van Dijk and de Croon focuses mainly on their observations on MonoDepth, but the available source code includes scripts to analyze SfMLearner, Semodepth and LKVOLearner:

In this blog post we will extend Van Dijk et. al. original experiments to include the transformer-based architecture DPT for monocular depth estimation. We will see if this type of black box analysis, that tests specific visual cues and the network’s responses can help us understand the inductive biases of vision transformers.


Pictorial Cues

Van Dijk and de Croon’s original publication initially reviews literature on pictorial cues that humans use when estimating distance. These pictorial cues could then be used as a starting point to conduct specific tests with the KITTI data . Only a subset of these cues are applicable to the KITTI dataset, while the others are unlikely to appear in the training set. This can be due to the low image resolution in the KITTI dataset preventing the visual cue from appearing often enough within the training set. While other cues do not apply for estimating ones distance to obstacles. Please review the original paper for a more in-depth analysis of the pictorial cues observed from studying human vision.

The pictorial cues that Van Dijk and de Croon initially focus on are the apparent size of objects and their position in the image. The apparent size of objects relies on the observation that objects further away appear smaller. This pictorial cue relies on having a certain understanding about what type of obstacle appears in the image and knowing its typical size.

The later experiments on changing the camera pose are motivated from the intuition that the neural networks implicitly learn a certain camera pose during training. In the KITTI dataset, this would be a camera pose that points parallel to the ground surface and has a fixed height above ground, where the camera is mounted in the vehicle. The performance of the neural networks after pitching and rolling the camera has interesting implications on the robustness of using the network with a different camera setup.

The final experiments focus on the visual cues that could be affected by the visual appearance of the obstacle in the image. Here the performance of the neural networks for arbitrary obstacles that don’t appear in the KITTI training set were tested. The presence of a shadow unter the object also appears to have a strong effect on the detection performance.

Vision Transformers

Figure 1: This architectural overview of DPT shows how the initial image data is treated as patches that are passed to the Transformer blocks. The Reassemble and Fusion blocks extend the original vision transformer for dense prediction tasks like monocular depth estimation.

Vision transformers make use of the transformer architecture that has shown success in Natural Language Processing and other sequence modeling tasks . The transformer encoder blocks consists of two layers: the first is a multi-head self-attention mechanism and the second is a multilayer perceptron. Around each of the two layers a residual connection is used, followed by a layer normalization .

The original transformer expected a sequence of tokens as input. The vision transformer converts non-overlapping patches of the input image into a flattened representation to resemble tokens. An additional positional embedding is added to the token to provide information to the network about the position of each token in the sequence (see Figure 1). DPT also adds a patch-independent readout token, that intends to add a global image representation. In particular, we are evaluating the performance of DPT-Hybrid, which uses a ResNet-50 feature extractor to create embeddings.

The treatment of the image as a sequence of patches and each patch as a flattenend linear projection of the image patch could have disadvantages for image data. By flattening the image patch to a one-dimensional token you lose information about neighborhood pixels within the patch. The position embedding only provides a relationship between the patches on a higher level. This leads to the conclusion that vision transformers lack an inductive bias in modeling local visual structures . Convolutional neural networks instrinsically model locality by learning weights for convolutional kernels. The convolutional kernels inherently only consider neighboring pixels in the input data.


In the following experiments we want to present the observations made by van Dijk and de Croon in their original work and our observations made using vision transformers. This format as a blog post with interactive elements lends itself well to revisit the original results from a new perspective.

Position vs. Apparent Height

Figure 2: The apparent height h of this car in the camera image and its vertical position y in the camera image could be pictorial cues to estimate the true object height H and its true ($Y_{obj}$, $Z_{obj}$ ) position in the camera coordinate frame. This figure is taken from the original publication .

The distance of an object could geometrically be estimated by two different approaches from a single image (see Figure 2). Both approaches assume a known and constant focal length $f$, which a neural network implicitly learns when trained on a dataset taken with the same camera setup. The first approach estimates the object’s real-world height $H$ and compares it to its apparent height $h$ on the image plane.

\[Z_{obj} = \frac{f}{h} H\]

This approach requires a good estimate of the real-world height $H$ of the objects present in the KITTI dataset. The limited number of obstacle classes (cars, trucks, pedestrians) in the KITTI dataset all share roughly the same height within each class. It is therefore plausible for the neural network to have learned to recognize these objects and use their apparent height as estimate for the distance. This approach relies only on the scale of the object for the distance estimate.

The other approach requires the additional assumption of a flat road surface. This assumption is valid in many urban environments. For this approach we look at the vertical position $y$ in the image, where the bottom part of the object intersects with the flat ground surface. We will call it the ground contact point from now on. The ground contact point can be compared in relation to the camera’s height $Y$ over the ground surface to estimate the distance:

\[Z_{obj} = \frac{f}{y} Y\]

Most scenes in the KITTI dataset are recorded on a flat road surface and the camera’s height above ground $Y$ remains constant across the dataset. Making a detection based on the position of the ground contact point $y$ in the image is also within the KITTI dataset. This approach relies only on the position of the object for the distance estimate.

Van Dijk and de Croon created a set of test images to estimate the effect of the scale and position of obstacles on the depth estimate accuracy. The test images consist of objects (mostly cars) that are cropped from different KITTI images and placed into open spaces of other images in the KITTI dataset. These crops are only done on images, where the scene remains plausible after the insertion.

The accuracy of the depth estimate is measured by the relative distance of the new object $Z_{new}$ from the distance the object originally appeared in $Z_{new} / Z = 1$. The relative distance is increased in 0.1 steps up to $Z_{new} / Z = 3$ The relative distance increase is done in three different manners (see Figure 3):

  1. Position and Scale: here we change the ground contact point and the scale of the object when increasing the relative distance
  2. Position Only: here we only change the ground contact point and keep the original scale of the object
  3. Scale Only: here we keep the original ground contact point fixed and change the scale of the object
Figure 3: Van Dijk and de Croon generated synthetic images from the KITTI Dataset, where an original obstacle (e.g. pedestrian, car, van) was placed in a new scene at different distances. They removed the scaling and change in vertical position to evaluate the importance of these components as visual cues for the depth estimation. Here is the same scene with both changes in its position and scaling, only changes in its position and no scaling, and only scaling and no changes in its position.

Comparing the predicted depths of both neural networks on these different objects in these 3 configurations should give us an idea how much both of these visual cues are used for the monocular depth estimation (see Figure 4).

Figure 4: We compare the relative distance estimate for objects that were pasted into KITTI scenes. The distance of the objects to the camera is increased and we compare if the estimate grows at the same rate. We distinguish between the two visual cues of the object scale and its position in the image. By keeping one of the two components fixed, we can observe how important that visual cues is for an accurate distance estimation. The experiment shows that using both visual cues of the position and scale are required for an accurate distance estimation. The scale on its own does not provide sufficient information for an increase in depth for both netowrks. Hover over the lines for a detailled description of the charts.

Camera Pose

The neural networks require some knowledge about the camera’s height above the ground surface and its pitch to estimate depth towards obstacles based on the vertical image position. The camera in the KITTI dataset is fixed rigidly to the car. The only deviations in its orientation come from pitch and heave motions of the car suspensions or from slopes in the terrain, which do not often occur in the urban environment of the KITTI dataset.

Van Dijk and de Croon propose two different approaches on how the camera pose is estimated by the neural networks.

  1. Fixed Camera Pose: here the neural networks assume a fixed camera pose in all images. The fixed camera pose allows the neural network to directly infer the depth towards an obstacle from the vertical image position. The assumption of a fixed camera pose across all images prevents the trained network from being directly transfered to another vehicle with a different camera setup.
  2. On-the-fly Pose Estimation: here the neural networks make use of visual cues like vanishing points or horizon lines to estimate its camera pose. A vertical shift of the horizon line in the image can give an indication on a change in pitch.

We repeat the following experiment, that was proposed by Van Dijk and de Croon to determine the effects of a change in pitch. This expirement should provide use an indication how much the neural networks assume a fixed camera pose or use visual cues like the horizon to estimate he camera pose on-the-fly for each input image. For this we take a smaller crop of the original KITTI image at different heights. From the image center, the area of interest in the crop is shifted by $\pm 30$px vertically from the image center. This translated to an approximate change in camera pitch of $\pm 2$-$3$ degrees.

We then compare how much the horizon shift in the predicted depth map corresponds with the true horizon shift produced by shifting the vertical position of the cropped image (see Figure 5). The horizon line in the predicted depth map is determined by fitting a line along disparity values of zero (i.e. infinite distance) along the central region of the KITTI images. If the predicted horizon line is shifted as expected with the synthetic change of pitch, then the neural network most likely does not just learn an absolute position for the horizon line in the input images.

Figure 5: We evaluate the change in position of the horizon lines in the predicted depth map when shifting its position in the crop of the image. The variation in the predicted horizon position grows for both networks as the true shift in pixels differs from its original position. DPT can more accurately detect the change in pitch in general compared to MonoDepth. Hover over the lines in the graph for a more detailed explanation.

We can observe in the previous figure that DPT can better account for the change in pitch and its effect on the horizon line in the cropped image data than the original MonoDepth could. DPT was trained on a larger meta-dataset called MIX 6 before being fine-tuned on the KITTI dataset. MIX 6 includes images taken from different camera setups, which could be a contributing factor on the robustness of PDT on changes in pitch. It is therefore not clear if the superior performance of DPT is mainly caused by its vision transformer network architecture. The larger training set used by DPT could be a determining factor.

Obstacle Recognition

The previous experiments have shown that the vertical position of objects in the image is an important visual cue for depth estimation. The vertical position can be determined by finding the object’s ground contact point. This does not require additional knowledge about the type of object, suggesting that the depth estimate can be done for arbitrary objects.

Van Dijk and de Croon did a qualitative analysis on how well MonoDepth could detect obstacles that were uncommmon in the KITTI dataset. We compared DPT’s performance on the same task (see Figure 3 below).

Figure 6: Van Dijk and de Croon cropped in a fridge in the center of an image from the KITTI dataset and observed that MonoDepth (2nd column) only partially detects the lower half of the obstacle. By adding a shadow at the ground contact point of the fridge (see the 2nd row), the detection performance for MonoDepth improves. This indicates that the dark shadow along the ground surface is a visual cue that MonoDepth uses for its depth estimation. This is similar to the first approaches on vision-based object recognition DPT can reliably detect the fridge and does not have a noticeable improvement when the shadow is added (3rd column). You can use the zoom tools (, ) and pan tools () when moving your mouse to the top right corner of the figure. It will manipulate all figures in parallel to allow for an easy comparison of the prediction maps.

Cropping out parts of familiar obstacles can provide some insight on the visual cues that more strongly lead to the detection of the object by the neural networks. Here we also look into the DPT’s performance on the same input images (see Figure 7 below).

Figure 7: Van Dijk and de Croon took a crop of a car from the KITTI dataset and pasted parts of the car into a new scene. DPT generally performs better in detecting the vertical parts of the car as an obstacle. Both networks struggle in detecting the horizontal edge of the car roof as an obstacle (see the 3rd and 4th row). It seems that the vertical part of the car on its own is mistaken as part of the background, but is associated to the car, when the car is completely visible. DPT is able to detect the bottom part of the car as an obstacle and estimates a distance similar as for the full car prediction (see the 1st and 3rd row). If only given the inner section of the obstacle, then path networks estimate the obstacle to be about $2$-$4$m further away than the full car (see the 4th row and hover over the depth map to read the estimated distance for a given pixel). This could be related to the ground contact point being higher in the image if the bottom section of the car is excluded.


Experiment Results

In this blog post we reproduced the original results on MonoDepth and extended them with experiments on the transformer-based neural network DPT. We generally observed an improved performance with the more recent DPT network.

DPT still uses the ground contact point as an important visual cue for the depth estimation of obstacles, but is less reliant on it as the original MonoDepth was. For DPT the scale of the known object also contributes for a more accurate depth estimation.

DPT can better handle changes in camera pitch than the original MonoDepth. We evaluated this on a DPT model that was pre-trained on the larger and diverse set of depth images than the original MonoDepth. So this could be the determining factor here instead of the difference in network architecture between transformer-based DPT and CNN-based MonoDepth.

DPT is able to detect unusual objects (e.g. fridge) as obstacles in the scene even if the drop shadow as visual cue is absent. MonoDepth heavily relies on the dark border on the ground surface for obstacle detection. Thin vertical obstacles (e.g. parts of the car obstacle) are better detected by DPT than MonoDepth. Here again, the diversity in the pre-training data for DPT could explain the more robust detection of unusual obstacles.

For both models, we observed the issue of detecting a thin horizontal obstacle (e.g. not detecting the top part of the car). This is a known issue in stereo vision caused by the often horizontal alignment of the camera pairs , but this does not have to hold for monocular depth estimation. A more thorough investigation is required to determine if this is a common issue for monocular depth estimation by neural networks.

We see a lot of value in the analysis of neural networks based on these type of experiments. The analysis can easily be extended to more neural networks and could be part of the evaluation dashboard used during architecture development and training evaluation. There are of course issues in this black box view on each of these networks . The goal should also be to create interpretable networks for monocular depth estimation .

Comments on Blog Format

We believe, that this interactive blog format lends itself well to allow a reader a better interaction with the results obtained by different algorithms. The data can be condensed in a form to take up less screen space and be presented as part of the main text instead of being made part of the supplementary material.

Modern plotting libraries like Plotly allow for HTML exports, which make the process to creating interactive visualizations for the web fairly easy. We see this approach to web publishing as a middle road between the now common large PDF publications with many embedded images and the time consuming process of creating visualizations for platforms like the former Distill .