Abstract
Among the many 3D representations, Coordinate-based implicit neural networks or neural
fields gained much appreciation in recent times for their ability to represent shape and appearance with very high fidelity and accuracy in 3D computer vision. Despite the advances,
however, it remained challenging to build generalizable neural fields for the category of the
objects without datasets like shapenet that provide “canonicalized” object instances that are
consistently aligned for their 3D position and orientation (pose). Aligning the objects in 3D
helps in many tasks for better generalization on 3d scene understanding, classification, and segmentation. 3D pose estimation can also be obtained by aligning the objects in the 3D. There
are methods that align 3d objects represented as point clouds/meshes. Now that we have a
new promising 3d implicit representation, there is a need to develop a method that helps to
align the neural-fields so that we can enjoy the same benefits we had in the space of point
clouds/meshes. Unlike point clouds/meshes neural-fields are parametrized by deep neural networks which is very hard to interpret. In this thesis, we present Canonical Field Network
(CaFi-Net), a self-supervised method to canonicalize the 3D pose of instances from an object
category represented as neural fields, specifically neural radiance fields (NeRFs).
Neural-fields, specifically NeRfs describe the 3D scene as a function of density and viewdependent color. Aligning the objects of a category depends on the geometry rather than the
color. That’s why CaFi-Net uses density alone to align the objects within the category. Canonicalization is tightly coupled with equivariant networks. In this work, we draw inspiration from
3D Equivariant networks and construct a CaFi-Net as an Equivariant network for rotations.
This network directly learns from continuous and noisy density fields by employing a Siamese
network architecture. Previous work has done this for points, but handling fields, specifically
vector fields, require us to consider rotation equivariance in both the position and orientation
of the field. So, to incorporate the rotation equivariance in the fields, we chose the gradient of
a scalar field density, which is a vector field, as the signal for building the rotation equivariance
in the CaFi-Net. We used spherical harmonics as a basic building block for the equivariant
convolution kernels for CaFi-Net. To handle the noisy signal, we weighted the features with
the density value at the point. We employed density-based clustering for the segregation of the background and foreground parts, which is utilized in the calculation of the losses. As there is
no publicly available dataset, in order to train the CaFi-Net, we created a simulator that renders
54 camera omnidirectional views for 1300 Nerf instances across 13 shapenet object categories.
During inference, our method takes pre-trained neural radiance fields of novel object instances at arbitrary 3D pose and estimates a canonical field with consistent 3D pose across
the entire category. As there are no metrics available for canonicalization for neural fields, we
used the same metrics used for the point clouds to evaluate the CaFi-Net Performance. Along
with the above metrics we have introduced a new metric Ground Truth Equivariance Consistency(GEC) which measures the canonical performance of CaFi-Net to manual labels. Extensive experiments on the above dataset of 1300 NeRF models show that our method matches
or exceeds the performance of 3D point cloud-based methods. We conducted ablation studies,
which included exploring the choice of the signal, weighing the equivariant features with the
density value, assessing the need for the Siamese network, and finally justifying the design
choice of the CaFi-Net. In the results section we showed renderings of the Neural-Fields of the
object from the canonical pose that are consistent across the category.