A Significant Increase in Digital Labor Automation

The newest frontier models automate substantially more real freelance work than their predecessors.

RLI was jointly developed by the Center for AI Safety and Scale Labs · Leaderboard (CAIS) · Leaderboard (Scale)

The Remote Labor Index (RLI) measures how often AI agents can complete real, economically valuable freelance projects (3D & CAD, architecture, graphic design, video and animation, audio, data analysis, web apps, and more) at a quality a paying client would actually accept. Every deliverable is judged by human evaluators against a gold-standard deliverable produced by a paid professional. The headline metric, the automation rate, is the share of projects where the AI's work is judged as good as (or better than) the human's.

At the benchmark's release, the best AI agent automated just 2.5% of projects. Today we're publishing results for three newer models, paired with stronger agent scaffolding. Automation rate is rapidly increasing.

1. New Automation Rates

Fable 5 reaches the highest automation rate measured so far, 16.1%, roughly double Opus 4.8 at 8.3%. GPT‑5.5 reaches 6.3%. All three score above every previously evaluated model.

GPT‑5.5
6.3%
Opus 4.8
8.3%
Highest
Fable 5
16.1%*

For context, the previous published leader sat at 4.17% (Opus 4.6 with the Claude Cowork scaffold), and the field topped out at 2.5% when RLI was released. The frontier has more than quadrupled in under eight months, a concrete signal of how quickly economically capable AI agents are advancing.

Automation Rate (%)
15
10
5
0
1.25%
2.08%
2.50%
2.92%
4.17%
6.3%
8.3%
16.1%
Gemini 3 Pro
Grok 4
GPT-5.2
Manus 1.6 Max
Opus 4.6
GPT-5.5
Opus 4.8
Fable 5
Full-automation rate on the Remote Labor Index: the share of projects where each model's deliverable was judged at least as good as the professional's. The three newly evaluated models (Fable 5, Opus 4.8, GPT‑5.5) score above every previously evaluated model.

Full-automation rate on the Remote Labor Index: the share of projects where each model's deliverable was judged at least as good as the professional's. The three newly evaluated models (Fable 5, Opus 4.8, GPT‑5.5) score above every previously evaluated model.

* We were able to evaluate 218 of RLI's 240 projects before access to Fable 5 was restricted by the U.S. government. We will update the results on the remaining projects shortly. The 22 unevaluated projects are spread uniformly across the benchmark, not concentrated in any sector or difficulty band. Even under the worst-case assumption that Fable 5 failed every missing project, its automation rate would still be 14.6%, higher than any other model.

2. What the Work Looks Like

Automation rate is a single number, but RLI projects are concrete pieces of commissioned work, each with a client brief, input files, and a professional deliverable. Use the arrows to move between examples.

Ring Design

3D & CAD

The brief

Re-create the client's existing engagement ring with its emerald-cut center stone swapped for a marquise cut, delivering an updated 3D model plus photorealistic rose- and yellow-gold renders.

Input Files

Reference photo and specification table for the original emerald-cut ring
ringdetails.png

Photorealistic render

GPT-5.5 render
OpenAIGPT‑5.5
Opus 4.8 render
ClaudeOpus 4.8
Fable 5 render
ClaudeFable 5
Human render of rose-gold marquise ring
Human

Underlying 3D model

GPT-5.5 CAD model
OpenAIGPT‑5.5
Opus 4.8 CAD model
ClaudeOpus 4.8
Fable 5 CAD model
ClaudeFable 5
Human CAD model
Human

Fable 5's ring design is qualitatively much better than deliverables from previous AIs, although upon closer examination it remains unprofessional (low-effort rounded prong design).

Advertisement Video

Video & Animation

The brief

Produce a ~60-second flat-design 2D animated advertisement for "Skyline Tree Services," set to the provided voiceover, that walks viewers through the company's tree-care process and builds trust in the brand.

Input Files

VoiceOver.wav
Raw voiceover audio (the only input file)

Deliverable: three frames per video (intro · consultation · safety)

Human
Human animation frames
ClaudeFable 5
Fable 5 animation frames
ClaudeOpus 4.8
Opus 4.8 animation frames
OpenAIGPT‑5.5
GPT-5.5 animation frames

On vector graphics 2D animation, visual quality improves noticeably with the newer models, with animations and synchronization with the audio becoming smoother.

Floor Plan & Renders

Architecture

The brief

From a scanned cadastral plan, site photos, and measurements, produce a clean dimensioned floor plan, furniture-layout options, and photorealistic renders of the redesigned bathroom.

Input Files

Scanned cadastral floor plan provided by the client
cadastral floor plan.jpg  ·  +11 more

Dimensioned floor plan

GPT-5.5 floor plan
OpenAIGPT‑5.5
Opus 4.8 floor plan
ClaudeOpus 4.8
Fable 5 floor plan
ClaudeFable 5
Human floor plan
Human

Bathroom render

GPT-5.5's actual 3D model: untextured, malformed massing
Actual 3D model
GPT-5.5's image-generated stand-in render
Image-gen “render”
OpenAIGPT‑5.5
Opus 4.8 bathroom render
ClaudeOpus 4.8
Fable 5 bathroom render
ClaudeFable 5
Human bathroom render
Human

The floor plans get visibly more accurate and the 3D models more detailed with the newer models, Fable 5 strongest among them. Note that GPT‑5.5's good-looking render is faked with an image generator. Its 3D model deliverable is shown beside it.

3. Human Judges Remain Necessary

As models improve and the benchmark grows more expensive to score, it's tempting to replace human evaluators with an automated "LLM judge." We built one: a frontier agent that opens both deliverables in real applications, inspects them the way a client would, and decides whether the AI's work would be accepted. We calibrated it on earlier models (Opus 4.5, GPT‑5.2, Opus 4.6, and Manus 1.6 Max) by tuning one global threshold to their combined 3.3% human rate.

Applying that same calibrated judge to the two newest models it had never seen, it badly overshot:

ModelHuman (ground truth)Automated judgeOverestimate
GPT‑5.56.25%17.9%≈ 2.9×
Opus 4.88.33%18.8%≈ 2.3×
Earlier models (calibration set)3.3%~3%≈ 1×

The inflation shows up specifically on the newest models and is robust across the reasonable calibration band: roughly 3× for GPT‑5.5 and ~2.5× for Opus 4.8. Importantly, the judge still ranked the models correctly (Spearman ρ = 0.90), placing both frontier models well above the rest, though it badly overstated how far above. That makes an automated grader useful for tracking relative progress, but not a substitute for human evaluation of absolute capability.

Why the Automated Judge Fails

The deeper reason is that evaluating an RLI deliverable is itself a demanding, agentic task. Doing it properly means opening the project's files in the right professional applications, operating those applications competently, and forming a judgment the way a client would, the very computer-use skills that today's agents are still weakest at. So an AI judge inherits the same limitations as the AI workers it grades. The GPT‑5.5 render above is a clean illustration: catching the fake requires opening the 3D project and inspecting the actual geometry, which a judge that can't reliably operate the software simply won't do.

This matches what others have found: OpenAI's GDPval reports that their automated grader agrees with human experts less often than humans agree with each other (a ~5-point gap), and is "not a full substitute for industry expert graders."

4. Better Elicitation

As publicly available agent scaffolds have matured, we have standardized on running each model in the strongest industry scaffold for its family, with only minor modifications. This keeps our setup close to how these models are actually deployed on real work, and it lets us focus on giving each model the tools and resources needed to measure its capability accurately rather than building bespoke harnesses.

Industry Scaffolds with Computer Use

We run Anthropic models in Claude Code and OpenAI models in Codex CLI, the same coding agents developers use day to day, modified only to add a native computer-use tool (screenshot, then click or type, then screenshot) so an agent can drive graphical applications and not just the command line. Each model works inside a full Linux desktop VM stocked with over 30 professional applications spanning every domain RLI covers: Blender, FreeCAD and OpenSCAD for 3D and CAD; GIMP, Inkscape and Scribus for design; Kdenlive and ffmpeg for video; Audacity, LMMS and MuseScore for audio; the full LibreOffice suite and LaTeX for documents; and more.

Increased Resource Allocation

We also take care that a weak setup does not understate what the models can do. Each project gets up to 24 hours of wall-clock time, one NVIDIA A100 GPU when a task needs it (for rendering, encoding, or simulation), a generous per-project budget, and each model's extended, high-reasoning-effort setting. The aim is to measure what a model can realistically accomplish on a genuine commission, not to understate it.

A Worker-Critic Loop

One addition we found especially useful for RLI is a worker-critic loop. Worker agents tend to be overly optimistic about their own output and rarely apply a critical eye to it, so an independent critic agent reviews each deliverable the way a demanding client would (opening files, taking screenshots, and checking the work against the brief), and the worker revises until the critic is satisfied or the budget is reached. In practice this improves results and lets additional budget translate into better deliverables.

We plan to study the effect of scaling cost more carefully in future work. For now we fix a per-project budget chosen so the cap is not reached on too many tasks: $50 by default, and $150 for Fable 5, whose higher per-token pricing requires a larger dollar budget for most projects to finish without hitting the cap.

5. What About Time Horizons?

RLI metadata includes human completion times for the gold-standard deliverables. These times come from the professionals who actually did the work and follow a clean log-normal distribution (Figure 4 of the RLI paper). Now that models are beginning to score above the floor on RLI, a natural question is whether we can apply time-horizon analysis, which summarizes a model's capability as the length of human task it can reliably complete.

Each line is a model's pass rate (win-or-tie against the human deliverable), from human evaluation, across RLI projects grouped by how long the work took a human professional. The relationship is flat to slightly rising, not the downward slope a time-horizon model would assume.

On RLI, the answer is no. Time-horizon analysis assumes that work taking humans longer is harder for AIs. While accurate in specific domains like coding, this assumption does not hold on the diverse distribution of remote work that RLI represents. A model's success rate does not fall as human completion time rises; many other factors determine whether it succeeds, and the time-horizon model fits the data poorly. This matches the jagged-frontier picture of AI capability. Some work that is quick for a skilled professional stays out of reach, such as transcribing music or playtesting a real-time game, while other work that would take a person hours, such as digital art or coding, is finished by current models in minutes. This may change as models improve. For now, we will continue to report the automation rate as our primary metric.

Outlook

Since RLI was released, the best automation rate has risen from 2.5% to 16.1%. Today's AIs still fall short of professional quality on most projects; none of the three Fable 5 deliverables above would be accepted as finished work. However, this increase in automation rate has been rapid, occurring in less than a year. RLI spans a wide range of economically valuable remote work, so this trend directly captures how quickly the automation of remote work is advancing. We will continue adding the results of major model releases to RLI, giving a current view of AI's ability to automate remote work and enabling policymakers and the public to proactively navigate AI-driven labor automation.

Our partners at Scale Labs run the manual evaluations for RLI. To have a model evaluated, contact udari.sehwag@scale.com.

Footnotes