🔗 Linking 2D and 3D#

Linking 2D and 3D modalities is key in various tasks enabled by our dataset, including the Grounded Compositional Recognition we propose.

We show here how to link the 2D and 3D modalities using the camera parameters and depth maps we provide in the dataloaders.

We provide the camera parameters as an \(8 \times 4\) tensor, where the upper \(4 \times 4\) matrix is the intrisic matrix \(K\), and the lower matrix is the extrinsic matrix \(M\).

from demo_utils_3D import *
from demo_utils_2D import *
import utils3D.plot as plt_utils_3D

ROOT_DIR_3D = "./3D_PC_data/"
ROOT_DIR_2D = "./2D_data/"

Projecting 3D pointclouds to the 2D image plane#

We illustrate here how to make use of the camera parameters to project 3D pointclouds to the 2D image plane.

Dataloader for the full data available in the WDS shards.
Adapt and filter to the fields needed for your usage.

    ...:               See FullLoader
    loader_3D:         The 3D loader to fetch the 3D data from.
from compat2D_3D import FullLoader2D_3D

valid_loader_2D_3D = (
    .make_loader(batch_size=1, num_workers=0)

# Unpacking the huge sample
for data_tuple in valid_loader_2D_3D:
    shape_id, image, target, \
    part_mask, mat_mask, depth, \
    style_id, view_id, view_type, \
    points, points_part_labels, points_mat_labels = data_tuple

We first display the pointcloud and the rendered image, separately.

# Plotting the 3D pointcloud
points_xyz = points[:, :3]
points_col = points[:, 3:]
show_tensors([image], size=4)

\(K\) and \(M\) are stacked vertically in the camera parameters tensor we provide. You can recover them by doing:

K, M = cam_parameters.chunk(2, axis=0)

We then project the 3D pointcloud xyz coordinates into the 2D plane, using \(K\) and \(M\). This is done simply via:

\( p_{xy} = KMp_{xyz} \)

Where both \(p_{xyz}\), \(p_{xy}\) are homogeneous point coordinates.

def project_to_2D(pointcloud, proj_matrix):
    Project 3D pointcloud to 2D image plane.
    pc = np.concatenate([pointcloud, np.ones([pointcloud.shape[0], 1])], axis=1).T

    # Applying the projection matrix
    pc = np.matmul(proj_matrix, pc)
    pc_final = pc/pc[2]

    return pc_final.T

def viz_projected_points(p, image, K, M):
    Project 3D points to 2D image plane and visualize.
    # Create a new figure and axis
    _, ax = plt.subplots()

    # Display the rendered image
    im = np.array(torch.squeeze(image).permute(1, 2, 0))
    im = im/np.amax(im)

    # Computing projection matrix from camera parameters
    proj_matrix = K @ M

    # Project the points to 2D
    p_xyz = p[:, :3]
    p_col = p[:, 3:]/255.
    p_xyz[:, 2] -= np.min(p_xyz[:, 2]) # Translate points to be above the rendering plane
    p_xyz = project_to_2D(p_xyz, proj_matrix=proj_matrix)

    # Sample 4000 points for display

    # Show the figure

K, M = cam_parameters[0].chunk(2, axis=0)

# Visualize the projected points
viz_projected_points(np.array(points), image, K, M)

Transform 2D pixels to 3D space#

Using depth maps and camera parameters, we can also transform our 2D RGB renders into the 3D space. We highlight here how to do that using the same dataloader.


Values from the far clipping plane (any orthogonal distance greater than 10) have to filtered to properly make use of depth maps.

import numpy as np

def img_to_pc(img, depth_map, K, M):
    Convert an image and its depth map to a point cloud.
    depth_map = depth_map.numpy()
    img = img.permute(1,2,0).numpy()
    K = K[:3,:3].numpy()

    # Generate the pixel grid
    h, w, c = img.shape
    u, v = np.meshgrid(np.arange(w), np.arange(h))

    # Compute the camera coordinates of the points
    fx, fy, cx, cy = K[0,0], K[1,1], K[0,2], K[1,2]
    X_cam = -np.stack([(u - cx) / fx, (v - cy) / fy, np.ones((h, w))], axis=-1) * depth_map[..., np.newaxis]
    X_cam = X_cam.reshape((-1, 3))

    # Apply camera extrinsics
    R, t = M[:3,:3], M[:3,3:]
    X_world = np.dot(R, X_cam.T) + t.reshape((-1, 1))
    X_world = X_world.T
    # Add color information to points
    points = np.hstack((X_world, img.reshape(h*w, c)))

    # Reshape the points array
    points = points.reshape((h * w, 3+c))
    # Filter depth values from the far clipping plane
    mask = (depth_map.flatten() < 10)
    points = points[mask]

    return points

By default, the \(M\) matrix is world-to-camera, as shown in the previous section. We must invert it before transforming points to 3D space.

M = np.linalg.inv(M)
points = img_to_pc(image[0], depth[0], K, M)

The transformed RGBD image can then be visualized:

p_xyz = points[:, :3]
p_col = points[:, 3:]

We augment the resulting pointcloud with a small additive Gaussian noise for visualization, in the interactive view below.

p_xyz_aug, p_col_aug = aug_pc(p_xyz, p_col, shift=1e-3, exp_steps=8)
p_xyz_aug = rotate_pc(p_xyz_aug)
plt_utils.show_pointcloud(p_xyz_aug, p_col_aug).show(height=300)