config.yml
Reference for config.yml project configuration.
We currently support two types of experiments: Reinforcement Learning and Supervised Fine-Tuning.
Templates such as sft, rl-single, and rl-multi are also available through the cli. To initialize a project using one of these templates run:
rnow init --template <name>We provide template configuration files for both:
Full Example (RL)
project_id: ""
project_name: "My RL Project"
dataset_id: ""
dataset_type: rl
organization_id: ""
data:
train_file: train.jsonl
batch_size: 2
group_size: 16
model:
path: Qwen/Qwen3-8B
qlora_rank: 32
qlora_alpha: 64
algorithm:
loss_fn: ppo
adv_estimator: grpo
kl_penalty_coef: 0.01
rollout:
max_turns: 1
max_tokens: 2048
termination_policy: last_tool
trainer:
num_epochs: 30
learning_rate: 0.0001
save_step: 20
eval_step: 0Full Example (SFT)
project_id: ""
project_name: "My SFT Project"
dataset_id: ""
dataset_type: sft
organization_id: ""
data:
train_file: train.jsonl
batch_size: 2
val_split: 0.2
model:
path: Qwen/Qwen3-8B
qlora_rank: 32
trainer:
num_epochs: 40
learning_rate: 0.0001
save_step: 2
eval_step: 3Project Fields
project_id: Unique project identifier (auto-generated). Required.
project_name: Human-readable project name. Required.
dataset_id: Unique dataset identifier (auto-generated). Required.
dataset_type: Type of experiment. Must be rl or sft. Required.
organization_id: Organization ID. Uses active org if not set.
data
train_file: Path to training data file. Default: train.jsonl
batch_size: Training batch size. Max: 32. Required.
group_size: Number of parallel rollouts per prompt. Default: 4, Max: 64. RL only.
Note: batch_size * group_size must be <= 2048 (sandbox concurrency limit).
val_split: Validation split ratio (0-1). SFT only.
model
path: Model to train. Must be from the supported models list, or a checkpoint ID from a previous training run to continue training. Required.
Multi-model training: You can provide a list of models to train them all with the same configuration. The CLI will submit separate runs for each model:
path:
- Qwen/Qwen3-4B-Instruct-2507
- Qwen/Qwen3-8B
- Qwen/Qwen3-30B-A3Bqlora_rank: QLoRA rank for efficient fine-tuning. Default: 32
Maximum LoRA rank by model:
- 32: openai/gpt-oss-120b, openai/gpt-oss-20b
- 64: Qwen/Qwen3-235B-A22B-Instruct-2507, Qwen/Qwen3-30B-A3B-, deepseek-ai/DeepSeek-V3.1
- 128: Qwen/Qwen3-32B, Qwen/Qwen3-8B*, Qwen/Qwen3-4B-Instruct-2507, all meta-llama/* models
qlora_alpha: QLoRA alpha scaling factor. Default: qlora_rank * 2
Supported Models: Qwen/Qwen3-235B-A22B-Instruct-2507, Qwen/Qwen3-30B-A3B-Instruct-2507, Qwen/Qwen3-30B-A3B, Qwen/Qwen3-30B-A3B-Base, Qwen/Qwen3-32B, Qwen/Qwen3-8B, Qwen/Qwen3-8B-Base, Qwen/Qwen3-4B-Instruct-2507, meta-llama/Llama-3.1-70B, meta-llama/Llama-3.3-70B-Instruct, meta-llama/Llama-3.1-8B, meta-llama/Llama-3.1-8B-Instruct, meta-llama/Llama-3.2-3B, meta-llama/Llama-3.2-1B, deepseek-ai/DeepSeek-V3.1, deepseek-ai/DeepSeek-V3.1-Base, openai/gpt-oss-120b, openai/gpt-oss-20b
You can also use a checkpoint ID from a previous training run to continue training from that checkpoint.
algorithm
RL only.
loss_fn: Loss function. Default: ppo
Possible values:
- ppo: Proximal Policy Optimization
- importance_sampling: Importance Sampling
adv_estimator: Advantage estimator. Default: grpo
Possible values:
- grpo: Generalized Reward Policy Optimization (recommended)
- gae: Generalized Advantage Estimation
- reinforce: Classic REINFORCE algorithm
kl_penalty_coef: KL divergence penalty coefficient. Default: 0.01
rollout
RL only.
max_turns: Maximum conversation turns. Default: 1
max_tokens: Maximum tokens per model response. Default: 2048
Important: This value must fit within the model's context window along with your prompt. The CLI validates that max_prompt_tokens + max_tokens < context_window.
For reasoning models: Models that use <think> tags consume many tokens for chain-of-thought reasoning. Set max_tokens to 8192 or higher (e.g., 16384) to allow sufficient space for reasoning.
termination_policy: When to end episode. Default: last_tool
Possible values:
- last_tool: Episode ends when model responds without a tool call
- max_turns: Episode always runs for exactly max_turns turns
Notes:
- A turn is one model generation. Each assistant response (with any tool calls and results) counts as one turn.
- Pre-existing messages in train.jsonl do not count toward max_turns.
thinking_mode: Chain-of-thought mode. Optional.
Possible values:
- disabled: Explicitly disable reasoning
- easy: Light reasoning
- medium: Moderate reasoning
- hard: Deep reasoning
When not specified, reasoning is automatically enabled for supported models. Set to disabled to turn it off.
max_tool_response_chars: Maximum characters for tool responses. Default: 4000
Longer responses are truncated. Set to null to disable truncation.
mcp_url: MCP server URL(s). Optional.
Connect to MCP servers to provide tools to your agent. Can be a single URL or an array of URLs:
- Single server:
mcp_url: "https://mcp.tavily.com/mcp/?tavilyApiKey=..." - Multiple servers:
mcp_url: ["https://mcp.tavily.com/...", "https://mcp.exa.ai/..."]
Note: You can use both mcp_url and env.py together—tools from both sources will be available to your agent.
See the MCP Tutorial for setup instructions.
trainer
num_epochs: Number of training epochs. Required.
learning_rate: Learning rate. Default: 0.0001
save_step: Save checkpoint every N steps. 0 = end of epoch only. Default: 0
eval_step: Evaluate every N steps. 0 = end of epoch only. Default: 0