When your task can’t be expressed as a prompt (agents, multi-step workflows, custom tooling, or heavy dependencies), connect your code to a playground. The iteration workflow stays the same: run evaluations, compare results side-by-side, and share with teammates. Your code handles task execution. The playground handles the rest.Two approaches differ in where your code runs:
Remote evals — Run evals on your own infrastructure, controlled from Braintrust. Your evaluation code runs on your machine or server. The Braintrust playground triggers execution, sends parameters, and displays results.
Sandboxes — Run evals in an isolated cloud sandbox, controlled from Braintrust. You push an execution artifact (a code bundle or container snapshot) and Braintrust invokes it on demand from the playground. No server to keep running.
Sandboxes are in beta and require a Pro or Enterprise plan. Self-hosted deployments require data plane version v2.0 (upcoming).
Your eval needs to call internal APIs, query private databases, or access services inside your VPN. Because remote evals execute on your infrastructure, that access is already available.
OS-specific or platform-locked tooling
Your eval requires software that only runs on a specific OS or machine — for example, a Windows-only simulation or a Unity project on a dedicated workstation. Remote evals let Braintrust trigger execution on whichever machine has the right environment set up.
Heavy or complex dev setup
Some tools are too painful to install on every teammate’s machine — game engines, large models, specialized SDKs. Set up the environment once on a shared server and let everyone else run the eval from the playground.
Data security and compliance
Sensitive data stays on your infrastructure. Only results are sent to Braintrust.
No server to maintain
Push your eval once and it’s always available from the playground — without keeping a process alive or worrying about uptime. This works well for stable eval versions the whole team can run on demand.
Team sharing without dev setup
An engineer packages the eval and pushes it. Teammates run it from the playground without cloning the repo, installing dependencies, or knowing anything about the execution environment.
Custom Python or TypeScript environments
Include pip packages with --requirements (Lambda) or bring your own container image (Modal) for full control over the runtime environment.
Reproducible, isolated runs
Each run executes against the same packaged artifact — same bundle or container snapshot — so results are consistent across teammates and over time.
Run evals on your own infrastructure, controlled from Braintrust. Your evaluation code runs on your machine or server. The Braintrust playground triggers execution, sends parameters, and displays results.
A remote eval looks like a standard eval call with a parameters field that defines configurable options. These parameters become UI controls in the playground.Install the SDK and dependencies:
To reference saved parameter configurations instead of defining them inline, use loadParameters() (TypeScript) or load_parameters() (Python). See Parameters for details.
Run your eval with the --dev flag to start a local server:
TypeScript
Python
Java
Ruby
npx braintrust eval path/to/eval.ts --dev
Dev server starts at http://localhost:8300. Configure the host and port:
--dev-host DEV_HOST: The host to bind to. Defaults to localhost. Set to 0.0.0.0 to bind to all interfaces (be cautious about security when exposing beyond localhost).
--dev-port DEV_PORT: The port to bind to. Defaults to 8300.
braintrust eval path/to/eval.py --dev
Dev server starts at http://localhost:8300. Configure the host and port:
--dev-host DEV_HOST: The host to bind to. Defaults to localhost. Set to 0.0.0.0 to bind to all interfaces (be cautious about security when exposing beyond localhost).
--dev-port DEV_PORT: The port to bind to. Defaults to 8300.
The Java SDK does not have a CLI command. Start the dev server programmatically using Devserver.builder()...build() followed by devserver.start(), as shown in the code example above.
This creates config/initializers/braintrust_server.rb with a slug-to-evaluator mapping auto-discovered from app/evaluators/.
Mount the engine:
# config/routes.rbRails.application.routes.draw do mount Braintrust::Contrib::Rails::Server::Engine, at: "/braintrust"end
Auth configurationThe engine defaults to :clerk_token authentication. For local development, set auth to :none in the generated initializer:
# config/initializers/braintrust_server.rbBraintrust::Contrib::Rails::Server::Engine.configure do |config| config.auth = :noneend
auth: :none disables authentication on incoming requests. Only use this for local development. BRAINTRUST_API_KEY must still be set on the server — it’s required to fetch resources from your project.
Sandboxes are in beta and require a Pro or Enterprise plan. Self-hosted deployments require data plane version v2.0 (upcoming).
Run evals in an isolated cloud sandbox, controlled from Braintrust. Push an execution artifact once and Braintrust invokes it on demand from the playground — no server to keep running.Braintrust supports two sandbox providers:
Lambda — AWS Lambda-based. The default for braintrust push. Supports both Python and TypeScript. No extra configuration needed.
Modal — Container-based via Modal. Requires a snapshotted Modal container image. Executes TypeScript evals only.
A sandbox eval looks like a standard eval call with a parameters field that defines configurable options. These parameters become UI controls in the playground.Install the SDK and dependencies:
Sandboxes require TypeScript SDK v3.7.1+ or Python SDK v0.12.1+.
Create the eval code:
my_eval.eval.ts
import { Eval, wrapOpenAI } from "braintrust";import OpenAI from "openai";import { z } from "zod";const client = wrapOpenAI(new OpenAI());Eval("my-project", { data: [{ input: "hello", expected: "HELLO" }], task: async (input, { parameters }) => { const completion = await client.chat.completions.create( parameters.main.build({ input }), ); return completion.choices[0].message.content ?? ""; }, scores: [], parameters: { main: { type: "prompt", name: "Main prompt", description: "The prompt used to process input", default: { messages: [{ role: "user", content: "{{input}}" }], model: "gpt-5-mini", }, }, prefix: z.string().describe("Optional prefix to prepend to input").default(""), },});
The parameter system uses different syntax across languages:
Feature
TypeScript
Python
Prompt parameters
type: "prompt" with messages array in default
type: "prompt" with nested prompt.messages and options
Scalar types
Zod schemas: z.string(), z.boolean(), z.number() with .describe()
Pydantic models with Field(description=...)
Parameter access
parameters.prefix
parameters.get("prefix")
Prompt usage
parameters.main.build({ input: value })
**parameters["main"].build(input=value)
Async
async/await
async/await
To reference saved parameter configurations instead of defining them inline, use loadParameters() (TypeScript) or load_parameters() (Python). See Parameters for details.