RT-DETR (Real-Time Detection Transformer) is an end-to-end real-time object detector developed by Baidu. It was introduced in the paper DETRs Beat YOLOs on Real-time Object Detection by Zhao, Lv, Xu, Wei, Wang, Dang, Liu, and Chen (2023).
Code, pretrained models, and implementation details are publicly available in the official GitHub repository.
RT-DETR is a real-time, end-to-end object detection model built on the DETR family of Transformer-based detectors. It was designed to address the trade-off between speed and accuracy in real-time detection, while preserving the key DETR advantage of being NMS-free, meaning it avoids the need for non-maximum suppression during post-processing.
The model introduces an efficient hybrid encoder that processes multiscale features by decoupling intra-scale interaction and cross-scale fusion, which significantly reduces computational cost. RT-DETR also uses IoU-aware query selection to provide strong initial object queries for the decoder, improving detection quality without sacrificing real-time performance.
Key traits of RT-DETR:
- End-to-end detector: Performs object detection without requiring non-maximum suppression.
- Efficient hybrid encoder: Accelerates multiscale feature processing by separating intra-scale and cross-scale operations.
- IoU-aware query selection: Improves the initialization of decoder queries for better object localization.
- Flexible speed tuning: Supports changing the number of decoder layers at inference time without retraining.
- Anchor-free detection: Avoids anchor design and simplifies the detection pipeline.

Figure 1 (from the paper) illustrates the architecture and workflow of RT-DETR:
- The last backbone stages, typically S3, S4, and S5, are used as multiscale inputs to the encoder.
- An efficient hybrid encoder transforms these multiscale features into image representations through AIFI (intra-scale feature interaction) and CCFM (cross-scale feature fusion module).
- IoU-aware query selection is used to choose a fixed number of informative image features as the initial object queries for the decoder.
- The decoder, together with auxiliary prediction heads, iteratively refines these queries to generate final bounding boxes and confidence scores.
- Because RT-DETR is a DETR-style detector, the full pipeline remains end-to-end and NMS-free.
RT-DETR is intended for:
- Real-time object detection in applications where both speed and accuracy are critical.
- Autonomous driving and robotics scenarios requiring fast environmental perception.
- Surveillance and monitoring systems with continuous real-time inference needs.
- Industrial and scientific vision tasks where transformer-based global context can improve detection quality.
Limitations:
- RT-DETR is focused on object detection, not segmentation or tracking.
- As a Transformer-based detector, deployment efficiency may depend strongly on the hardware and inference backend.
- While RT-DETR provides flexible inference speed tuning, performance still depends on model scale and available compute resources.
- Newer variants such as RT-DETRv2 further improve the baseline, so the original RT-DETR may not always represent the most up-to-date version for production use.
¶ BibTeX entry and citation info
@misc{lv2023detrs,
title={DETRs Beat YOLOs on Real-time Object Detection},
author={Yian Zhao and Wenyu Lv and Shangliang Xu and Jinman Wei and Guanzhong Wang and Qingqing Dang and Yi Liu and Jie Chen},
year={2023},
eprint={2304.08069},
archivePrefix={arXiv},
primaryClass={cs.CV}
}