Language models
Reasoning, tool-use, agentic harnesses, and long-context faithfulness — scored with programmatic rubrics rather than vibes.
Standard Instrument builds environments — a task dataset, a model harness, and a scoring rubric — for evaluating and training language models, world models, and body models.
Submit your modelSend rollouts or grant API access. Receive a calibrated scorecard and failure dossier — or plug the rubric straight into an RL loop.
The environment abstraction is model-agnostic. We start where measurement matters most today, and expand into embodiment as body models mature.
Reasoning, tool-use, agentic harnesses, and long-context faithfulness — scored with programmatic rubrics rather than vibes.
Physical consistency, action controllability, long-horizon drift, causal response, and downstream task usefulness — measured against calibrated baselines.
Embodied policies for manipulation, locomotion, and whole-body control — evaluated in the same environments they can be trained in.
An environment is everything required to run and score a model on a task: inputs, harness, and rubric — packaged so the same artefact evaluates a model today and trains it tomorrow.
Curated inputs, held-out splits
Every environment ships with a versioned corpus of task inputs — held-out, adversarial, and long-tail — sourced from published literature and our own red-teams.
n tasks · vN
Tools, sandboxes, context
A reproducible runtime for the model under test: tool interfaces, isolated sandboxes, memory & context management, and deterministic seeding. Bring rollouts or an API endpoint.
reproducible · sealed
Calibrated, auditable scoring
Programmatic rubrics grade correctness, physical consistency, causal response, controllability, and long-horizon drift. Every score is auditable back to the frame or token.
score ∈ [0, 1]
Failure dossier + scorecard
A bound scorecard with per-task attribution, failure traces, and comparisons against published baselines. Reproducible from the released seed.
PDF · JSON · seed
Three stages, fully instrumented. Every result is timestamped, hashed, and reproducible from the published seed.
Pick from the catalogue — LLM reasoning & tool-use, world-model physics & controllability, or embodied body-model tasks — or commission a bespoke environment.
Send rollouts as tensors or video, or grant authenticated API access. We execute the harness with fixed seeds and record every interaction.
Receive a signed evaluation scorecard and raw traces, or feed the same rubric back into a training loop to improve the model against the signal.
A bound, signed document. Plate I below is a sample from a recent evaluation. Real reports include per-task attribution, rubric traces, and raw measurement tables.

Every evaluation produces a calibrated scorecard, a task-indexed failure dossier, and a rubric trace — the same signal you can then feed straight back into training.