The World's Most Capable Computer Agent

AGI reaches 76.26% task success on OSWorld benchmark, setting a new global record. (Leaderboard)

Our OSAgent has achieved state-of-the-art performance on the OSWorld leaderboard, surpassing every previous system (as of October 23, 2025).

Why is OSWorld significant?

OSWorld is the first benchmark where AI agents use real computers (Ubuntu, Windows, and macOS) to complete 369 real-world tasks across apps like Chrome, LibreOffice, VS Code, and Thunderbird. It measures true computer-use intelligence through execution-based evaluation of open-ended workflows.

Highlights:

Human success: 72.36% (Yes our agent outperform humans!)
Supports cross-app, multimodal, and intermediate-state tasks

Example tasks:

"Can you enable the 'Do Not Track' feature in Chrome to enhance my online privacy?"
"Could you turn my image into CYMK mode within GIMP?"
"Change the text color in the textboxes to on slide 1 yellow, red, and green, respectively, in top-to-bottom order. Use exactly these colors—no variations (e.g., no dark red, light green, etc.)."
"I need to include the experiment results from "~/Documents/awesome-desktop/expe-results.xlsx" into the currently writing report. Specifically, extract the results of GPT-4 and insert a table into the "Main Results" section of my report."
"I click in terminal: terminal->132x43 to change terminal size but after each reboot terminal size is set to default setting and I have to change it again. Help me set it permanently"

How OSAgent does it:

OS Agent is an end-to-end trained computer-use policy that operates directly on real VMs (Linux/Windows/macOS) with a low-level action space, precise clicks, multi-clicks, typing, hotkeys, etc. It also supports optional workflow-level macros.

It achieves 76.26% (274.52/360) success on OSWorld, i.e., superhuman vs the ~72% human baseline. The agent was trained to continuously self-check its actions (the “verification-generation gap”): it verifies outcomes in real time and corrects on the next turn when a step fails.

Training uses a general-reasoning base model, scaled with hundreds of thousands of synthetic and OSWorld tasks plus our internal REAL browser environments. We run large-scale rollouts on OSWorld VMs, and AutoEval executes task-specific verifiers to produce reliable rewards.

This feeds an online RL loop: run agent → score via verifiers → mine successes/failures → update the policy → repeat. The combination of execution-based rewards, self-verification, and fine-grained controls is what drives the jump in accuracy across diverse computer tasks.

‍

Join Us + Try the API

If you’re a researcher or engineer who wants to push embodied intelligence forward, join our team.

If you’re a developer or company that wants to integrate computer control into your own applications, try our API.

We’re building the future of everyday AGI, agents that can use your phone, your laptop, and your browser as fluidly as you do.

👉 Join AGI Inc. Careers

👉 Use the Agent API