Calibrated
environments for
frontier models.

Standard Instrument builds environments — a task dataset, a model harness, and a scoring rubric — for evaluating and training language models, world models, and body models.

Submit your model

Send rollouts or grant API access. Receive a calibrated scorecard and failure dossier — or plug the rubric straight into an RL loop.

72hTurnaround
IVRubric axes
2,184Probes / model
σ < 0.01Reproducibility
§ 0Scope

One instrument, three model classes.

The environment abstraction is model-agnostic. We start where measurement matters most today, and expand into embodiment as body models mature.

Now

Language models

Reasoning, tool-use, agentic harnesses, and long-context faithfulness — scored with programmatic rubrics rather than vibes.

Now

World models

Physical consistency, action controllability, long-horizon drift, causal response, and downstream task usefulness — measured against calibrated baselines.

Next

Body models

Embodied policies for manipulation, locomotion, and whole-body control — evaluated in the same environments they can be trained in.

§ IAnatomy of an Environment

What each environment contains.

An environment is everything required to run and score a model on a task: inputs, harness, and rubric — packaged so the same artefact evaluates a model today and trains it tomorrow.

  1. I.

    Task Dataset

    Curated inputs, held-out splits

    Every environment ships with a versioned corpus of task inputs — held-out, adversarial, and long-tail — sourced from published literature and our own red-teams.

    unit

    n tasks · vN

  2. II.

    Model Harness

    Tools, sandboxes, context

    A reproducible runtime for the model under test: tool interfaces, isolated sandboxes, memory & context management, and deterministic seeding. Bring rollouts or an API endpoint.

    unit

    reproducible · sealed

  3. III.

    Reward Rubric

    Calibrated, auditable scoring

    Programmatic rubrics grade correctness, physical consistency, causal response, controllability, and long-horizon drift. Every score is auditable back to the frame or token.

    unit

    score ∈ [0, 1]

  4. IV.

    Signed Report

    Failure dossier + scorecard

    A bound scorecard with per-task attribution, failure traces, and comparisons against published baselines. Reproducible from the released seed.

    unit

    PDF · JSON · seed

§ IIProtocol

From submission to bound report.

Three stages, fully instrumented. Every result is timestamped, hashed, and reproducible from the published seed.

Stage 0101

Select Environment

Pick from the catalogue — LLM reasoning & tool-use, world-model physics & controllability, or embodied body-model tasks — or commission a bespoke environment.

Stage 0202

Submit Model

Send rollouts as tensors or video, or grant authenticated API access. We execute the harness with fixed seeds and record every interaction.

Stage 0303

Report or Train

Receive a signed evaluation scorecard and raw traces, or feed the same rubric back into a training loop to improve the model against the signal.

The Standard Instrument report.

A bound, signed document. Plate I below is a sample from a recent evaluation. Real reports include per-task attribution, rubric traces, and raw measurement tables.

Plate I — sample failure report including oscilloscope traces, polar drift plot, calibration grid, and causal attribution table
Plate I · Sample Report Excerptfig. 1–15

Every evaluation produces a calibrated scorecard, a task-indexed failure dossier, and a rubric trace — the same signal you can then feed straight back into training.

Rubric
v0.9
Physical consistency
0.02 % violations
Long-horizon drift @ 500
σ = 0.014
Causal response
91.8 % pass
Controllability
Fails on contact
Task usefulness · pick-place
0.74 @ k = 8
Request sample report