config.yml
Reference for config.yml project configuration.
We currently support two types of experiments: Reinforcement Learning and Supervised Finetuning.
Templates such as sft, rl-single, and rl-multi are also available through the cli. To initialize a project using one of these templates run:
rnow init --template <name>We provide template configuration files for both:
Full Example (RL)
project_id: ""
project_name: "My RL Project"
dataset_id: ""
dataset_type: rl
organization_id: ""
data:
train_file: train.jsonl
batch_size: 2
group_size: 16
model:
path: Qwen/Qwen3-8B
qlora_rank: 32
qlora_alpha: 64
algorithm:
loss_fn: ppo
adv_estimator: grpo
kl_penalty_coef: 0.01
rollout:
max_turns: 1
max_context_window: 2048
termination_policy: last_tool
trainer:
num_epochs: 30
learning_rate: 0.0001
save_step: 20
# Optional: Run-dependent evals
evals:
- eval_id: your_eval_id
step: 100Full Example (SFT)
project_id: ""
project_name: "My SFT Project"
dataset_id: ""
dataset_type: sft
organization_id: ""
data:
train_file: train.jsonl
batch_size: 2
val_split: 0.2
model:
path: Qwen/Qwen3-8B
qlora_rank: 32
trainer:
num_epochs: 40
learning_rate: 0.0001
save_step: 2Project Fields
project_id: Unique project identifier (auto-generated). Required.
project_name: Human-readable project name. Required.
dataset_id: Unique dataset identifier (auto-generated). Required.
dataset_type: Type of experiment. Must be rl or sft. Required.
organization_id: Organization ID. Uses active org if not set.
data
train_file: Path to training data file. Default: train.jsonl
batch_size: Training batch size. Max: 32. Required.
group_size: Number of parallel rollouts per prompt. Default: 4, Max: 64. RL only.
Note: batch_size * group_size must be <= 2048 (sandbox concurrency limit).
val_split: Validation split ratio (0-1). SFT only.
model
path: Model to train. Must be from the supported models list, or a checkpoint ID from a previous training run to continue training. Required.
Multi-model training: You can provide a list of models to train them all with the same configuration. The CLI will submit separate runs for each model:
path:
- Qwen/Qwen3-4B-Instruct-2507
- Qwen/Qwen3-8B
- Qwen/Qwen3-30B-A3Bqlora_rank: QLoRA rank for efficient finetuning. Default: 32
Maximum LoRA rank by model:
- 32: openai/gpt-oss-120b, openai/gpt-oss-20b
- 64: Qwen/Qwen3-235B-A22B-Instruct-2507, Qwen/Qwen3-30B-A3B-, deepseek-ai/DeepSeek-V3.1
- 128: Qwen/Qwen3-32B, Qwen/Qwen3-8B*, Qwen/Qwen3-4B-Instruct-2507, all meta-llama/* models
qlora_alpha: QLoRA alpha scaling factor. Default: qlora_rank * 2
Supported Models: Qwen/Qwen3-235B-A22B-Instruct-2507, Qwen/Qwen3-30B-A3B-Instruct-2507, Qwen/Qwen3-30B-A3B, Qwen/Qwen3-30B-A3B-Base, Qwen/Qwen3-32B, Qwen/Qwen3-8B, Qwen/Qwen3-8B-Base, Qwen/Qwen3-4B-Instruct-2507, meta-llama/Llama-3.1-70B, meta-llama/Llama-3.3-70B-Instruct, meta-llama/Llama-3.1-8B, meta-llama/Llama-3.1-8B-Instruct, meta-llama/Llama-3.2-3B, meta-llama/Llama-3.2-1B, deepseek-ai/DeepSeek-V3.1, deepseek-ai/DeepSeek-V3.1-Base, openai/gpt-oss-120b, openai/gpt-oss-20b
You can also use a checkpoint ID from a previous training run to continue training from that checkpoint.
algorithm
RL only.
loss_fn: Loss function. Default: ppo
Possible values:
- ppo: Proximal Policy Optimization
- importance_sampling: Importance Sampling
adv_estimator: Advantage estimator. Default: grpo
Possible values:
- grpo: Generalized Reward Policy Optimization (recommended)
- gae: Generalized Advantage Estimation
- reinforce: Classic REINFORCE algorithm
kl_penalty_coef: KL divergence penalty coefficient. Default: 0.01
rollout
RL only.
max_turns: Maximum conversation turns. Default: 1
max_tokens: Maximum tokens per model response. Default: 2048
Important: This value must fit within the model's context window along with your prompt. The CLI validates that max_prompt_tokens + max_tokens < context_window.
For reasoning models: Models that use <think> tags consume many tokens for chain-of-thought reasoning. Set max_tokens to 8192 or higher (e.g., 16384) to allow sufficient space for reasoning.
termination_policy: When to end episode. Default: last_tool
Possible values:
- last_tool: Episode ends when model responds without a tool call
- max_turns: Episode always runs for exactly max_turns turns
Notes:
- A turn is one model generation. Each assistant response (with any tool calls and results) counts as one turn.
- Pre-existing messages in train.jsonl do not count toward max_turns.
reasoning_mode: Chain-of-thought mode. Optional.
Possible values:
- disabled: Explicitly disable reasoning
- low: Light reasoning
- medium: Moderate reasoning
- high: Deep reasoning
When not specified, reasoning is automatically enabled for supported models. Set to disabled to turn it off.
max_context_window: Maximum context window in tokens. Default: 32768
Tool results are automatically truncated to fit within the context window. If a tool result would exceed the available space, it is truncated and the rollout ends.
include_thinking: Preserve <think> blocks in conversation history. Default: false (stripped from history).
Note: This only affects what the model sees on subsequent turns. Traces always include thinking for debugging.
mcp_url: MCP server URL(s). Optional.
Connect to MCP servers to provide tools to your agent. Can be a single URL or an array of URLs:
- Single server:
mcp_url: "https://mcp.tavily.com/mcp/?tavilyApiKey=..." - Multiple servers:
mcp_url: ["https://mcp.tavily.com/...", "https://mcp.exa.ai/..."]
Note: You can use both mcp_url and tools.py together—tools from both sources will be available to your agent.
See the MCP Tutorial for setup instructions.
trainer
num_epochs: Number of training epochs. Required.
learning_rate: Learning rate. Default: 0.0001
save_step: Save checkpoint every N steps. -1 = end only, N = every N steps. Default: -1
evals
Top-level section for run-dependent evaluations that trigger during training.
eval_id: Eval ID from rnow eval. Copy from the eval detail page. Required.
step: Run eval every N training steps. Required.
Example:
evals:
- eval_id: cmla1l13e000004jwxu39jrpy
step: 100 # Run every 100 stepsHow it works: First create an eval with rnow eval, then reference its ID here. Evals run asynchronously in separate sandboxes and don't block training. Pass@k scores appear in your training graphs under the "Evaluation" section.