
.jpg)
AGI reaches 76.26% task success on OSWorld benchmark, setting a new global record. (Leaderboard)
Our OSAgent has achieved state-of-the-art performance on the OSWorld leaderboard, surpassing every previous system (as of October 23, 2025).
.jpg)
OSWorld is the first benchmark where AI agents use real computers (Ubuntu, Windows, and macOS) to complete 369 real-world tasks across apps like Chrome, LibreOffice, VS Code, and Thunderbird. It measures true computer-use intelligence through execution-based evaluation of open-ended workflows.
Highlights:
Example tasks:
OS Agent is an end-to-end trained computer-use policy that operates directly on real VMs (Linux/Windows/macOS) with a low-level action space, precise clicks, multi-clicks, typing, hotkeys, etc. It also supports optional workflow-level macros.
It achieves 76.26% (274.52/360) success on OSWorld, i.e., superhuman vs the ~72% human baseline. The agent was trained to continuously self-check its actions (the “verification-generation gap”): it verifies outcomes in real time and corrects on the next turn when a step fails.
Training uses a general-reasoning base model, scaled with hundreds of thousands of synthetic and OSWorld tasks plus our internal REAL browser environments. We run large-scale rollouts on OSWorld VMs, and AutoEval executes task-specific verifiers to produce reliable rewards.
This feeds an online RL loop: run agent → score via verifiers → mine successes/failures → update the policy → repeat. The combination of execution-based rewards, self-verification, and fine-grained controls is what drives the jump in accuracy across diverse computer tasks.
If you’re a researcher or engineer who wants to push embodied intelligence forward, join our team.
If you’re a developer or company that wants to integrate computer control into your own applications, try our API.
We’re building the future of everyday AGI, agents that can use your phone, your laptop, and your browser as fluidly as you do.