config.yml

Reference for config.yml project configuration.

We currently support two types of experiments: Reinforcement Learning and Supervised Finetuning.

Templates such as sft, rl-single, and rl-multi are also available through the cli. To initialize a project using one of these templates run:

rnow init --template <name>

We provide template configuration files for both:

Full Example (RL)

project_id: ""
project_name: "My RL Project"
dataset_id: ""
dataset_type: rl
organization_id: ""
data:
  train_file: train.jsonl
  batch_size: 2
  group_size: 16
model:
  path: Qwen/Qwen3-8B
  qlora_rank: 32
  qlora_alpha: 64
algorithm:
  loss_fn: ppo
  adv_estimator: grpo
  kl_penalty_coef: 0.01
rollout:
  max_turns: 1
  max_context_window: 2048
  termination_policy: last_tool
trainer:
  num_epochs: 30
  learning_rate: 0.0001
  save_step: 20

# Optional: Run-dependent evals
evals:
  - eval_id: your_eval_id
    step: 100

Full Example (SFT)

project_id: ""
project_name: "My SFT Project"
dataset_id: ""
dataset_type: sft
organization_id: ""
data:
  train_file: train.jsonl
  batch_size: 2
  val_split: 0.2
model:
  path: Qwen/Qwen3-8B
  qlora_rank: 32
trainer:
  num_epochs: 40
  learning_rate: 0.0001
  save_step: 2

Project Fields

project_id: Unique project identifier (auto-generated). Required.

project_name: Human-readable project name. Required.

dataset_id: Unique dataset identifier (auto-generated). Required.

dataset_type: Type of experiment. Must be rl or sft. Required.

organization_id: Organization ID. Uses active org if not set.

data

train_file: Path to training data file. Default: train.jsonl

batch_size: Training batch size. Max: 32. Required.

group_size: Number of parallel rollouts per prompt. Default: 4, Max: 64. RL only.

Note: batch_size * group_size must be <= 2048 (sandbox concurrency limit).

val_split: Validation split ratio (0-1). SFT only.

model

path: Model to train. Must be from the supported models list, or a checkpoint ID from a previous training run to continue training. Required.

Multi-model training: You can provide a list of models to train them all with the same configuration. The CLI will submit separate runs for each model:

path:
  - Qwen/Qwen3-4B-Instruct-2507
  - Qwen/Qwen3-8B
  - Qwen/Qwen3-30B-A3B

qlora_rank: QLoRA rank for efficient finetuning. Default: 32

Maximum LoRA rank by model:

32: openai/gpt-oss-120b, openai/gpt-oss-20b
64: Qwen/Qwen3-235B-A22B-Instruct-2507, Qwen/Qwen3-30B-A3B-, deepseek-ai/DeepSeek-V3.1
128: Qwen/Qwen3-32B, Qwen/Qwen3-8B*, Qwen/Qwen3-4B-Instruct-2507, all meta-llama/* models

qlora_alpha: QLoRA alpha scaling factor. Default: qlora_rank * 2

Supported Models: Qwen/Qwen3-235B-A22B-Instruct-2507, Qwen/Qwen3-30B-A3B-Instruct-2507, Qwen/Qwen3-30B-A3B, Qwen/Qwen3-30B-A3B-Base, Qwen/Qwen3-32B, Qwen/Qwen3-8B, Qwen/Qwen3-8B-Base, Qwen/Qwen3-4B-Instruct-2507, meta-llama/Llama-3.1-70B, meta-llama/Llama-3.3-70B-Instruct, meta-llama/Llama-3.1-8B, meta-llama/Llama-3.1-8B-Instruct, meta-llama/Llama-3.2-3B, meta-llama/Llama-3.2-1B, deepseek-ai/DeepSeek-V3.1, deepseek-ai/DeepSeek-V3.1-Base, openai/gpt-oss-120b, openai/gpt-oss-20b

You can also use a checkpoint ID from a previous training run to continue training from that checkpoint.

algorithm

RL only.

loss_fn: Loss function. Default: ppo

Possible values:

ppo: Proximal Policy Optimization
importance_sampling: Importance Sampling

adv_estimator: Advantage estimator. Default: grpo

Possible values:

grpo: Generalized Reward Policy Optimization (recommended)
gae: Generalized Advantage Estimation
reinforce: Classic REINFORCE algorithm

kl_penalty_coef: KL divergence penalty coefficient. Default: 0.01

rollout

RL only.

max_turns: Maximum conversation turns. Default: 1

max_tokens: Maximum tokens per model response. Default: 2048

Important: This value must fit within the model's context window along with your prompt. The CLI validates that max_prompt_tokens + max_tokens < context_window.

For reasoning models: Models that use <think> tags consume many tokens for chain-of-thought reasoning. Set max_tokens to 8192 or higher (e.g., 16384) to allow sufficient space for reasoning.

termination_policy: When to end episode. Default: last_tool

Possible values:

last_tool: Episode ends when model responds without a tool call
max_turns: Episode always runs for exactly max_turns turns

Notes:

A turn is one model generation. Each assistant response (with any tool calls and results) counts as one turn.
Pre-existing messages in train.jsonl do not count toward max_turns.

reasoning_mode: Chain-of-thought mode. Optional.

Possible values:

disabled: Explicitly disable reasoning
low: Light reasoning
medium: Moderate reasoning
high: Deep reasoning

When not specified, reasoning is automatically enabled for supported models. Set to disabled to turn it off.

max_context_window: Maximum context window in tokens. Default: 32768

Tool results are automatically truncated to fit within the context window. If a tool result would exceed the available space, it is truncated and the rollout ends.

include_thinking: Preserve <think> blocks in conversation history. Default: false (stripped from history).

Note: This only affects what the model sees on subsequent turns. Traces always include thinking for debugging.

mcp_url: MCP server URL(s). Optional.

Connect to MCP servers to provide tools to your agent. Can be a single URL or an array of URLs:

Single server: mcp_url: "https://mcp.tavily.com/mcp/?tavilyApiKey=..."
Multiple servers: mcp_url: ["https://mcp.tavily.com/...", "https://mcp.exa.ai/..."]

Note: You can use both mcp_url and tools.py together—tools from both sources will be available to your agent.

See the MCP Tutorial for setup instructions.

trainer

num_epochs: Number of training epochs. Required.

learning_rate: Learning rate. Default: 0.0001

save_step: Save checkpoint every N steps. -1 = end only, N = every N steps. Default: -1

evals

Top-level section for run-dependent evaluations that trigger during training.

eval_id: Eval ID from rnow eval. Copy from the eval detail page. Required.

step: Run eval every N training steps. Required.

Example:

evals:
- eval_id: cmla1l13e000004jwxu39jrpy
  step: 100  # Run every 100 steps

How it works: First create an eval with rnow eval, then reference its ID here. Evals run asynchronously in separate sandboxes and don't block training. Pass@k scores appear in your training graphs under the "Evaluation" section.

config.yml

On this page