ReinforceNowReinforceNow

train.jsonl

Training data in JSON Lines format. Each line is one training example.

Fields

messages: Conversation array. Required.

rewards: Reward function names to evaluate. Required for RL.

metadata: Custom data passed to reward functions via args.metadata.

tools: Filter available tools for this task. If omitted, all tools are available.

variables: Template substitution using $variable syntax.

docker: Docker image for isolated sandbox execution. Required if using sandbox=True tools/rewards.

docker_env: Environment variables for the sandbox container (object).

Message Roles

system: System instructions (optional, appears first)

user: User message (at least one required)

assistant: Assistant response (for multi-turn context)

Examples

RL

{"messages": [{"role": "user", "content": "What is 2+2?"}], "rewards": ["accuracy"], "metadata": {"answer": "4"}}

SFT

{"messages": [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi there!"}]}

With Tools

{"messages": [{"role": "user", "content": "Search for AI news"}], "rewards": ["quality"], "tools": ["search"]}

With Sandbox

For entries using sandbox=True tools or rewards:

{"messages": [{"role": "user", "content": "Write a Python script that creates hello.txt"}], "rewards": ["file_created"], "tools": ["run_python"], "docker": "python:3.11-slim"}

With Environment Variables

Use docker_env to pass configuration to the container:

{
  "messages": [{"role": "user", "content": "Query the database"}],
  "rewards": ["correct_result"],
  "docker": "myorg/db-tools:latest",
  "docker_env": {"DB_HOST": "localhost", "DEBUG": "true"}
}

Custom Docker images must be built for linux/amd64. Sandboxes run on x86_64 Linux servers. Images built on ARM Macs without the platform flag will fail:

# Correct
docker build --platform linux/amd64 -t myorg/myimage:latest .

# Wrong (will fail)
docker build -t myorg/myimage:latest .