¶ FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects
FoundationPose is a unified foundation model for 6D object pose estimation and tracking of novel objects. It was introduced in the paper FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects by Wen, Yang, Kautz, and Birchfield (2024).
Code, model weights, demo assets, and implementation details are publicly available in the official GitHub repository.
FoundationPose is a unified framework for estimating and tracking the 6D pose of previously unseen objects. It supports both model-based and model-free settings: at test time, the method can be applied to a novel object either by providing its CAD model or by capturing a small number of reference images, without requiring object-specific fine-tuning.
A key contribution of FoundationPose is that it bridges these two settings through a neural implicit representation, which enables effective novel-view synthesis while keeping the downstream pose estimation modules unchanged. This allows both model-based and model-free operation to be handled within the same general framework.
Key traits of FoundationPose:
- Unified model-based and model-free setup: Works with either CAD models or a few reference images of a novel object.
- No object-specific fine-tuning: Can be applied directly at test time to unseen objects.
- Neural implicit representation: Bridges model-based and model-free pose estimation through view synthesis.
- Transformer-based architecture: Uses a modern architecture designed for strong generalization across diverse objects and scenarios.
- Large-scale synthetic training: Learns robust pose estimation and tracking behavior from extensive synthetic data.

Figure 1 (from the paper) illustrates the workflow of FoundationPose:
- A novel object is provided either as a CAD model or through a small set of reference images.
- A neural implicit object representation is used to synthesize views and unify the object representation across the model-based and model-free cases.
- A pose estimation module predicts the object’s 6D pose from image observations.
- After the first-frame pose is established, the system can switch into tracking mode for subsequent frames.
- The model is trained with large-scale synthetic data, contrastive learning, and architectural components designed for strong zero-shot generalization to unseen objects.
FoundationPose is intended for:
- 6D object pose estimation of previously unseen objects in RGB-D or vision-based pipelines.
- Pose tracking of novel objects in videos after initialization.
- Robotics applications, such as manipulation and object interaction in dynamic scenes.
- Augmented reality applications, where accurate object alignment and tracking are required.
Limitations:
- Practical use may require either a CAD model of the object or a small number of reference images, depending on the setup.
- The method relies heavily on synthetic training data, so real-world performance may vary depending on the domain gap.
- Some training assets and pretrained variants related to diffusion-based texture augmentation are not released due to legal restrictions.
- The released code and data are distributed under the NVIDIA Source Code License, which may restrict certain downstream uses.
¶ BibTeX entry and citation info
@InProceedings{foundationposewen2024,
author = {Bowen Wen and Wei Yang and Jan Kautz and Stan Birchfield},
title = {{FoundationPose}: Unified 6D Pose Estimation and Tracking of Novel Objects},
booktitle = {CVPR},
year = {2024},
}