gpt-oss is an open-weight reasoning model family released by OpenAI, with two main variants: gpt-oss-120b and gpt-oss-20b. It was introduced in the official release Introducing gpt-oss and further documented in the gpt-oss model card.
Code, reference implementations, setup guides, and usage examples are publicly available in the official GitHub repository.
gpt-oss is a family of open-weight language models designed for strong reasoning, tool use, and flexible deployment. The two released variants target different operating points: gpt-oss-120b is aimed at higher-end reasoning workloads that fit on a single 80 GB GPU, while gpt-oss-20b is optimized for lower-latency and lighter local use cases, including deployments that can run within roughly 16 GB of memory. Both models are released under the Apache 2.0 license.
Architecturally, both models are mixture-of-experts Transformers. The 120b variant has 117B total parameters with 5.1B active parameters per token, while the 20b variant has 21B total parameters with 3.6B active parameters per token. They support 128k context length, use alternating dense and sparse attention patterns, grouped multi-query attention, and RoPE positional embeddings.
Key traits of gpt-oss:
- Open-weight reasoning models: Designed for strong performance on reasoning-heavy tasks.
- Mixture-of-experts architecture: Improves efficiency by activating only part of the model per token.
- Tool-use support: Built for workflows involving browsing, Python, and structured outputs.
- Configurable reasoning effort: Supports low, medium, and high reasoning settings.
- Flexible deployment: Can be used with Transformers, vLLM, Ollama, LM Studio, and other inference stacks.
- The family consists of two open-weight models, gpt-oss-120b and gpt-oss-20b, covering both larger-scale and lighter local deployment scenarios.
- Both models use a MoE Transformer architecture, with sparse expert activation for better inference efficiency.
- The models were post-trained for reasoning and agentic workflows, including instruction following, tool use, and structured outputs.
- OpenAI released the models together with the Harmony response format and reference implementations for local inference and tool integration.
- The models are intended to run across a wide ecosystem of deployment platforms and hardware targets.
gpt-oss is intended for:
- Reasoning-heavy language tasks that benefit from open-weight deployment.
- Agentic workflows involving function calling, browsing, Python execution, and structured outputs.
- Local or on-premise inference where developers want more control over deployment and customization.
- Fine-tuning and experimentation in research or enterprise settings.
Limitations:
- The models are primarily text-only, with the release positioned around language reasoning rather than native multimodal input.
- Proper use depends on the Harmony format, and incorrect prompting formats may reduce performance.
- The released tool implementations are described as reference or educational implementations, not production-ready systems.
- Although the models expose full chain-of-thought internally, it should not be shown directly to end users.
¶ BibTeX entry and citation info
@misc{openai2025gptoss120bgptoss20bmodel,
title={gpt-oss-120b & gpt-oss-20b Model Card},
author={OpenAI},
year={2025},
eprint={2508.10925},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.10925},
}